igraph and structured text exploration
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts) with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.
A while back I came across a blog post on igraph and word statistics (LINK). It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn. As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well. The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:
Build a word frequency matrix and covert to an adjacency matrix
set.seed(10) X <- matrix(rpois(100, 1), 10, 10) colnames(X) <- paste0("Guy_", 1:10) rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps', 'over', 'a', 'bot', 'named', 'Dason') X #word frequency matrix Y <- X >= 1 Y <- apply(Y, 2, as, "numeric") #boolean matrix rownames(Y) <- rownames(X) Z <- t(Y) %*% Y #adjacency matrix
Build a graph from the above matrix
g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected') # remove loops library(igraph) g <- simplify(g) # set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) #Plot a Graph set.seed(3952) layout1 <- layout.auto(g) #for more on layout see: browseURL("http://finzi.psych.upenn.edu/R/library/igraph/html/layout.html") opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room plot(g, layout=layout1)
Alter widths of edges based on dissimilarity of people’s dialogue
#adjust the widths of the edges and add distance measure labels #use 1 - binary (?dist) a proportion distance of two vectors #1 is perfect and 0 is no overlap (using 1 - binary) edge.weight <- 7 #a maximizing thickness constant z1 <- edge.weight*(1-dist(t(X), method="binary")) E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge z2 <- round(1-dist(t(X), method="binary"), 2) E(g)$label <- c(z2)[c(z2) != 0] plot(g, layout=layout1) #check it out!
Scale the label cex based on word counts
SUMS <- diag(Z) #frequency (same as colSums(X)) label.size <- .5 #a maximizing label size constant V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size plot(g, layout=layout1) #check it out!
Add vertex coloring based on factoring
#add factor information via vertex color set.seed(15) V(g)$gender <- rbinom(10, 1, .4) V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue") plot(g, layout=layout1) #check it out! plot(g, layout=layout1, edge.curved = TRUE) #curve it up par(mar=opar) #reset margins
Try it interactively with tkplot
#interactive version tkplot(g) #an interactive version of the graph tkplot(g, edge.curved =TRUE)
This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.
This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.
For a .txt version of this demonstration click here
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.