'Preparing data for Gephi with Python or R

I am currently trying to prepare my data for network analysis in Gephi. I have data in the following format:

Raw Data

SHIPMENTCOUNT   US Port Foreign Initial Port    Ultimate Port
1   BALTIMORE   ANTWERP DORTMUND
1   BALTIMORE   ANTWERP MUMBAI
1   BALTIMORE   ANTWERP SPIJKENISSE
1   BALTIMORE   BEILUN  XIAOLAN
1   BALTIMORE   BREMERHAVEN ETTRINGEN
1   BALTIMORE   BREMERHAVEN HILTER
1   BALTIMORE   CAUCEDO SANTOS
1   BALTIMORE   CRISTOBAL   SANTOS
1   BALTIMORE   FELIXSTOWE  MILTON KEYNES
1   BALTIMORE   FELIXSTOWE  WALTHAM CROSS
1   BALTIMORE   FELIXSTOWE  WISBECH
1   BALTIMORE   GENOA   CARPENEDOLO
1   BALTIMORE   GENOA   PONTE SAN MAR
1   BALTIMORE   HALIFAX TORONTO
1   BALTIMORE   HAMBURG HORST
1   BALTIMORE   HAMBURG VILSHOFEN
1   BALTIMORE   HONG KONG   NHAVA SHEVA
1   BALTIMORE   HONG KONG   ZHANGJIAGANG
1   BALTIMORE   LE HAVRE    GREENOCK
1   BALTIMORE   LE HAVRE    MAURON
1   BALTIMORE   LE HAVRE    WORCESTER
1   BALTIMORE   LEGHORN FIESSO UMBERI
1   BALTIMORE   LIVERPOOL   GOLDTHORPE
1   BALTIMORE   NAVEGANTES  BETANIA
1   BALTIMORE   ROTTERDAM   EEMNES

And I need data in this format (The network is directed and goes from the ultimate port to the foreign initial port, and from the foreign initial port to the US port. The weights are the same for the values in the same row):

Formatted data

Source  Target  Type    Weight
DORTMUND    ANTWERP Directed    1
MUMBAI  ANTWERP Directed    1
SPIJKENISSE ANTWERP Directed    1
XIAOLAN BEILUN  Directed    1
ETTRINGEN   BREMERHAVEN Directed    1
HILTER  BREMERHAVEN Directed    1
SANTOS  CAUCEDO Directed    1
SANTOS  CRISTOBAL   Directed    1
MILTON KEYNES   FELIXSTOWE  Directed    1
WALTHAM CROSS   FELIXSTOWE  Directed    1
WISBECH FELIXSTOWE  Directed    1
CARPENEDOLO GENOA   Directed    1
PONTE SAN MAR   GENOA   Directed    1
TORONTO HALIFAX Directed    1
HORST   HAMBURG Directed    1
VILSHOFEN   HAMBURG Directed    1
NHAVA SHEVA HONG KONG   Directed    1
ZHANGJIAGANG    HONG KONG   Directed    1
GREENOCK    LE HAVRE    Directed    1
MAURON  LE HAVRE    Directed    1
WORCESTER   LE HAVRE    Directed    1
FIESSO UMBERI   LEGHORN Directed    1
GOLDTHORPE  LIVERPOOL   Directed    1
BETANIA NAVEGANTES  Directed    1
EEMNES  ROTTERDAM   Directed    1


Solution 1:[1]

assuming df is your initial table (please DO provide example data by using dput: see below)

library(dplyr)

get_unique_edges <- function(two_columns){
    two_columns %>% unlist %>% combn(., m = 2) %>%  t %>% unique
}


edges <- 
    structure(
        .Data = data.frame(rbind(
            get_unique_edges(df[4:3]),
            get_unique_edges(df[3:2])
        ), 'Directed', 1),
        .Names = c('Source', 'Target', 'Type', 'Weight')
    ) %>%
    filter(Source != Target) %>%
    arrange(Source, Target)

output:

edges %>% head

## > edges %>% head
##    Source      Target     Type Weight
## 1 ANTWERP   BALTIMORE Directed      1
## 2 ANTWERP      BEILUN Directed      1
## 3 ANTWERP      BEILUN Directed      1
## 4 ANTWERP BREMERHAVEN Directed      1
## 5 ANTWERP BREMERHAVEN Directed      1
## 6 ANTWERP     CAUCEDO Directed      1

example data:

## output of dput(df):
structure(list(SHIPMENTCOUNT = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), US.Port = c("BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE", 
"BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE"), Foreign.Initial.Port = c("ANTWERP", 
"ANTWERP", "ANTWERP", "BEILUN", "BREMERHAVEN", "BREMERHAVEN", 
"CAUCEDO", "CRISTOBAL"), Ultimate.Port = c("DORTMUND", "MUMBAI", 
"SPIJKENISSE", "XIAOLAN", "ETTRINGEN", "HILTER", "SANTOS", "SANTOS"
)), class = "data.frame", row.names = c(NA, 8L))

Solution 2:[2]

Here is a Python version, using networkx.

import pandas as pd
import networkx as nx
from io import StringIO

# The input as in the question in a string to be read by pandas
input_str = """SHIPMENTCOUNT    US Port Foreign Initial Port    Ultimate Port
1   BALTIMORE   ANTWERP DORTMUND
1   BALTIMORE   ANTWERP MUMBAI
1   BALTIMORE   ANTWERP SPIJKENISSE
1   BALTIMORE   BEILUN  XIAOLAN
1   BALTIMORE   BREMERHAVEN ETTRINGEN
1   BALTIMORE   BREMERHAVEN HILTER
1   BALTIMORE   CAUCEDO SANTOS
1   BALTIMORE   CRISTOBAL   SANTOS
1   BALTIMORE   FELIXSTOWE  MILTON KEYNES
1   BALTIMORE   FELIXSTOWE  WALTHAM CROSS
1   BALTIMORE   FELIXSTOWE  WISBECH
1   BALTIMORE   GENOA   CARPENEDOLO
1   BALTIMORE   GENOA   PONTE SAN MAR
1   BALTIMORE   HALIFAX TORONTO
1   BALTIMORE   HAMBURG HORST
1   BALTIMORE   HAMBURG VILSHOFEN
1   BALTIMORE   HONG KONG   NHAVA SHEVA
1   BALTIMORE   HONG KONG   ZHANGJIAGANG
1   BALTIMORE   LE HAVRE    GREENOCK
1   BALTIMORE   LE HAVRE    MAURON
1   BALTIMORE   LE HAVRE    WORCESTER
1   BALTIMORE   LEGHORN FIESSO UMBERI
1   BALTIMORE   LIVERPOOL   GOLDTHORPE
1   BALTIMORE   NAVEGANTES  BETANIA
1   BALTIMORE   ROTTERDAM   EEMNES
"""

# Read the above
df = pd.read_csv(StringIO(input_str), sep='\t')
df['Weight'] = 1.0

# Note: 'Weight' is added but Gephi puts its own 'Weight' which is again 1.0
G = nx.from_pandas_edgelist(df, source='Ultimate Port', target='Foreign Initial Port', edge_attr=['Weight'], create_using=nx.DiGraph)

# Output
nx.write_gexf(G, 'output.gexf')

As it is written as comment in the code, a 'Weight' edge-attribute is added but Gephi keeps its own weight, which is 1.0, and keeps the weight values in a separate column. One can always use Gephi's Copy data to another column from Data Laboratory, to copy values between columns.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 I_O
Solution 2 Yannis P.