'How to split a string of author names by comma into a data frame and generate an edgelist to plot a network?

I have a long list of publications saved as a single column in a data frame. I'd like to generate a network of a short subset of co-authors that have contributed to these publications (ignoring the remaining authors). I'm wondering how to extract the edge list for the subset of co-authors in order to generate a network using igraph or cytoscape.

I've read in the publication list and saved the authors to a new dataframe in a single column.

head(pubs)
[1] "Darwin C, Mendel G, Guy R. This is the title of the paper. Super high impact Journal. 1866. Oct 19;16(1):229."
[2] "Franklin R, Watson J, Dawkins R, Mendel G, Darwin C. The use of time travel for writing scientific articles. Soc for Time Trav Sc. 2019. Aug 14;1(1):1."

I then removed the unnecessary info from the rows (e.g. pub date, title, journal, etc.) using the following code:

my_colleagues <- c("Darwin C", "Mendel G", "Franklin R", "Dawkins R") 
authors <- as.data.frame(gsub("\\..*","",pubs$V1))
colnames(authors) <- "Authors"
authors_split <- data.frame(do.call('rbind', strsplit(as.character(authors$Authors),', ',fixed=FALSE)))

I expect this to return a data frame where all of the author names are separated out into new columns. While I am able to split the names, it repeats the author names, in sequence, to fill all of the columns for the longest string of author names in the publication list (i.e. the longest author list consists of 23 names, so there are 23 columns in all rows, even if a publication has <23 authors). Instead of repeating the names I would like these columns to be blank or contain NA.

Beyond that however, I'm not sure how to go about testing for co-authorship (i.e. connections between nodes) from my short-list of my_colleagues and how to create a undirected edgelist to plot my network. Ultimately I would like an 'undirected edgelist' (essentially a two column dataframe) that looks like the following:

head(edgelist)

[1] "Darwin C" "Mendel F"
[2] "Franklin R" "Watson J"
[3] "Franklin R" "Dawkins R"
[4] "Franklin R" "Mendel G"
[5] "Franklin R" "Darwin C"
[6] "Watson J" "Dawkins R"
[7] "Watson J" "Mendel G"
[8] "Watson J" "Darwin C"
[9] "Dawkins R" "Mendel G"
[10] "Dawkins R" "Darwin C"


Solution 1:[1]

Here's a solution with lists instead of a dataframe and and a regular character vector as the resulting edgelist. I don't know if you have any other requirements but this seemed to to do the job:

# libraries
library(igraph)

# example data
books <- c("Darwin C, Mendel G, Guy R. This is the title of the paper. Super high impact Journal. 1866. Oct 19;16(1):229.",
            "Franklin R, Watson J, Dawkins R, Mendel G, Darwin C. The use of time travel for writing scientific articles. Soc for Time Trav Sc. 2019. Aug 14;1(1):1.")

# splitting textlines at periods
Split <- strsplit(books, split = ".", fixed = TRUE)

# getting first argument of list (everything before the first peroid are author names)
authors <- unlist(lapply(Split,"[[",1))

# splitting at commas to get the different names
SplitAuthors <- sapply(authors, strsplit, split = ",", fixed = TRUE)

# getting all combinations of authors to get all connections between them
AuthorCombinations <- sapply(SplitAuthors,function(x){combn(unlist(x),m = 2)})

# unlisting the matrices of combinations of authors into an edgelist (+ deleting automatically generated list names)
AuthorEdges <- rapply(AuthorCombinations,unlist)
names(AuthorEdges) <- NULL

# removing trailing whitespace from authornames
AuthorEdges <- trimws(AuthorEdges)

# creating graph
AuthorGraph <- graph(AuthorEdges, directed = FALSE)

# plotting graph
plot(AuthorGraph)

EDIT: I just saw that you only want to look at a subgraph of specific authors. If your data is not extremely big, you can simply use the code above to generate the whole network and then easily look into different subgraphs by specifying authors like this:

Excerpt <- induced_subgraph(AuthorGraph,c("Darwin C", "Mendel G","Franklin R"))
plot(Excerpt)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1