'Preparing data for Gephi with Python or R
I am currently trying to prepare my data for network analysis in Gephi. I have data in the following format:
Raw Data
SHIPMENTCOUNT US Port Foreign Initial Port Ultimate Port
1 BALTIMORE ANTWERP DORTMUND
1 BALTIMORE ANTWERP MUMBAI
1 BALTIMORE ANTWERP SPIJKENISSE
1 BALTIMORE BEILUN XIAOLAN
1 BALTIMORE BREMERHAVEN ETTRINGEN
1 BALTIMORE BREMERHAVEN HILTER
1 BALTIMORE CAUCEDO SANTOS
1 BALTIMORE CRISTOBAL SANTOS
1 BALTIMORE FELIXSTOWE MILTON KEYNES
1 BALTIMORE FELIXSTOWE WALTHAM CROSS
1 BALTIMORE FELIXSTOWE WISBECH
1 BALTIMORE GENOA CARPENEDOLO
1 BALTIMORE GENOA PONTE SAN MAR
1 BALTIMORE HALIFAX TORONTO
1 BALTIMORE HAMBURG HORST
1 BALTIMORE HAMBURG VILSHOFEN
1 BALTIMORE HONG KONG NHAVA SHEVA
1 BALTIMORE HONG KONG ZHANGJIAGANG
1 BALTIMORE LE HAVRE GREENOCK
1 BALTIMORE LE HAVRE MAURON
1 BALTIMORE LE HAVRE WORCESTER
1 BALTIMORE LEGHORN FIESSO UMBERI
1 BALTIMORE LIVERPOOL GOLDTHORPE
1 BALTIMORE NAVEGANTES BETANIA
1 BALTIMORE ROTTERDAM EEMNES
And I need data in this format (The network is directed and goes from the ultimate port to the foreign initial port, and from the foreign initial port to the US port. The weights are the same for the values in the same row):
Formatted data
Source Target Type Weight
DORTMUND ANTWERP Directed 1
MUMBAI ANTWERP Directed 1
SPIJKENISSE ANTWERP Directed 1
XIAOLAN BEILUN Directed 1
ETTRINGEN BREMERHAVEN Directed 1
HILTER BREMERHAVEN Directed 1
SANTOS CAUCEDO Directed 1
SANTOS CRISTOBAL Directed 1
MILTON KEYNES FELIXSTOWE Directed 1
WALTHAM CROSS FELIXSTOWE Directed 1
WISBECH FELIXSTOWE Directed 1
CARPENEDOLO GENOA Directed 1
PONTE SAN MAR GENOA Directed 1
TORONTO HALIFAX Directed 1
HORST HAMBURG Directed 1
VILSHOFEN HAMBURG Directed 1
NHAVA SHEVA HONG KONG Directed 1
ZHANGJIAGANG HONG KONG Directed 1
GREENOCK LE HAVRE Directed 1
MAURON LE HAVRE Directed 1
WORCESTER LE HAVRE Directed 1
FIESSO UMBERI LEGHORN Directed 1
GOLDTHORPE LIVERPOOL Directed 1
BETANIA NAVEGANTES Directed 1
EEMNES ROTTERDAM Directed 1
Solution 1:[1]
assuming df is your initial table (please DO provide example data by using dput
: see below)
library(dplyr)
get_unique_edges <- function(two_columns){
two_columns %>% unlist %>% combn(., m = 2) %>% t %>% unique
}
edges <-
structure(
.Data = data.frame(rbind(
get_unique_edges(df[4:3]),
get_unique_edges(df[3:2])
), 'Directed', 1),
.Names = c('Source', 'Target', 'Type', 'Weight')
) %>%
filter(Source != Target) %>%
arrange(Source, Target)
output:
edges %>% head
## > edges %>% head
## Source Target Type Weight
## 1 ANTWERP BALTIMORE Directed 1
## 2 ANTWERP BEILUN Directed 1
## 3 ANTWERP BEILUN Directed 1
## 4 ANTWERP BREMERHAVEN Directed 1
## 5 ANTWERP BREMERHAVEN Directed 1
## 6 ANTWERP CAUCEDO Directed 1
example data:
## output of dput(df):
structure(list(SHIPMENTCOUNT = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), US.Port = c("BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE",
"BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE"), Foreign.Initial.Port = c("ANTWERP",
"ANTWERP", "ANTWERP", "BEILUN", "BREMERHAVEN", "BREMERHAVEN",
"CAUCEDO", "CRISTOBAL"), Ultimate.Port = c("DORTMUND", "MUMBAI",
"SPIJKENISSE", "XIAOLAN", "ETTRINGEN", "HILTER", "SANTOS", "SANTOS"
)), class = "data.frame", row.names = c(NA, 8L))
Solution 2:[2]
Here is a Python version, using networkx
.
import pandas as pd
import networkx as nx
from io import StringIO
# The input as in the question in a string to be read by pandas
input_str = """SHIPMENTCOUNT US Port Foreign Initial Port Ultimate Port
1 BALTIMORE ANTWERP DORTMUND
1 BALTIMORE ANTWERP MUMBAI
1 BALTIMORE ANTWERP SPIJKENISSE
1 BALTIMORE BEILUN XIAOLAN
1 BALTIMORE BREMERHAVEN ETTRINGEN
1 BALTIMORE BREMERHAVEN HILTER
1 BALTIMORE CAUCEDO SANTOS
1 BALTIMORE CRISTOBAL SANTOS
1 BALTIMORE FELIXSTOWE MILTON KEYNES
1 BALTIMORE FELIXSTOWE WALTHAM CROSS
1 BALTIMORE FELIXSTOWE WISBECH
1 BALTIMORE GENOA CARPENEDOLO
1 BALTIMORE GENOA PONTE SAN MAR
1 BALTIMORE HALIFAX TORONTO
1 BALTIMORE HAMBURG HORST
1 BALTIMORE HAMBURG VILSHOFEN
1 BALTIMORE HONG KONG NHAVA SHEVA
1 BALTIMORE HONG KONG ZHANGJIAGANG
1 BALTIMORE LE HAVRE GREENOCK
1 BALTIMORE LE HAVRE MAURON
1 BALTIMORE LE HAVRE WORCESTER
1 BALTIMORE LEGHORN FIESSO UMBERI
1 BALTIMORE LIVERPOOL GOLDTHORPE
1 BALTIMORE NAVEGANTES BETANIA
1 BALTIMORE ROTTERDAM EEMNES
"""
# Read the above
df = pd.read_csv(StringIO(input_str), sep='\t')
df['Weight'] = 1.0
# Note: 'Weight' is added but Gephi puts its own 'Weight' which is again 1.0
G = nx.from_pandas_edgelist(df, source='Ultimate Port', target='Foreign Initial Port', edge_attr=['Weight'], create_using=nx.DiGraph)
# Output
nx.write_gexf(G, 'output.gexf')
As it is written as comment in the code, a 'Weight' edge-attribute is added but Gephi keeps its own weight, which is 1.0
, and keeps the weight values in a separate column. One can always use Gephi's Copy data to another column from Data Laboratory, to copy values between columns.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | I_O |
Solution 2 | Yannis P. |