Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Emacs package Org-Roam provides a powerful tool to take notes following the idea of the Zettelkasten method. You can write notes with all the power that Emacs provides while linking your thoughts to each other and with your bibliography. This article shows how to analyse and visualise Org-Roam knowledge networks with the iGraph package and the R language.
The image below shows the current state of my personal knowledge network as visualised by iGraph. The Org-Roam community is working hard to create a visual user interface for Org-Roam with network diagrams. While this user interface is a great tool to explore your network, Emacs users tend to prefer writing code instead of clicking buttons. This article adds to this concept by showing how to apply some mathematical analysis and delve deep into your Zettelkasten. The beauty of iGraph is that it is available in R, Python, Mathematica and C/C++.
You can download the code in this article from GitHub.
Loading the Database
Org-Roam stores information about the notes in your system in an SQLite database. This database holds information about the nodes (registered files and headings) and links. The DBI package enables connecting to these databases.
The first lines open the connection. You will have to change the path of the database location to your preferences for this to work. The two queries extract the unique ID and the title for all nodes. The second query extracts the links between nodes. The Org-Roam database also stores external links, which are excluded from this analysis.
The next part cleans the titles by removing quotation marks and joins the tables. I could have done this in the query, but I know Tidyverse better than SQL.
## Visualise Org-Roam databse ## Connect to database and extract nodes and links library(DBI) roam <- dbConnect(RSQLite::SQLite(), "~/Documents/org-roam/data/org-roam.db") nodes <- dbGetQuery(roam, "SELECT id, title FROM nodes") links <- dbGetQuery(roam, "SELECT source, dest FROM links WHERE type = '\"id\"'") dbDisconnect(roam) ## Clean node names and create network table library(tidyverse) nodes <- nodes %>% mutate(title = str_remove_all(title, "\"")) network <- left_join(links, nodes, by = c("source" = "id")) %>% left_join(nodes, by = c("dest" ="id")) %>% select(from = title.x, to = title.y)
The result of this code is a data frame with two names of the nodes at each vertex in the network. Any unconnected nodes are ignored due to the left join. The next step is to convert this data to iGraph format and start visualising and analysing.
Create iGraph network
The iGraph package uses a specific data format. The graph.data.frame()
function converts a data frame to an iGraph network. The simplify()
function removes any multiple links and self-references.
## Create network diagram library(igraph) g <- graph.data.frame(network) g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)
Analyse the Network
The iGraph package has an extensive library of functions to analyse networks. In this section, we look at a measure for centrality and clustering.
Centrality
Not all nodes in the network are of equal importance. Centrality is a concept in graph theory that describes the importance of a node in a network. The most straightforward method of calculating the centrality of a node is to count the number of vertices (connections) that it has with other nodes. The example below shows a simple undirected graph and the degree of each node.
In a directed graph, like with Org-Roam, we can calculate the in-degree and out-degree, the number of vertices point to the node, or the number of vertices pointing from the node (the number of backlinks and the number of links). The degree()
function of the iGraph package calculates degree centrality. The code below generates a table with the top ten most connected nodes.
## Centrality centrality <- tibble(Node = names(degree(g)), Degree = degree(g, mode = "total"), Links = degree(g, mode = "out"), Backlinks = degree(g, mode = "in")) centrality %>% top_n(n = 10) %>% arrange(desc(Degree))
Communities
While some nodes might be more connected than others, it would also be good to know which nodes are closest to each other. In other words, can we find clusters of nodes (communities) to find meta-structure in the network? A community is a sub-network with densely connected nodes, but less so with nodes outside the community. Expressed probabilistically, two random nodes are more likely to be associated when they form part of the same community than when they don't. Thus, community detection increases the parsimony of the network by identifying those groups of nodes that are most closely related to each other.
Using the example network, two communities can be visually distinguished: nodes 1–4 and nodes 5–7. Texts A, B and C belong to the first community, while texts C and D belong to the second community. Text C (nodes 2 and 5) spans the two communities of discourse. This solution is valid because each node has more connections to nodes within its own community than nodes outside it.
The examples' communities are easily detected visually, but community detection becomes more difficult as the network grows. Community detection is a mathematical process to cluster nodes into cohesive sub-networks.
Several algorithms for community detection have been developed based on a range of mathematical principles. However, the numerical validation of community structure has not yet been satisfactorily solved, and no agreed single method exists to assess the quality of communities. In my research into knowledge networks, the Spinglass algorithm provided the most interpretable results.
## Community-detection c <- cluster_spinglass(g)
This algorithm results in a named vector that assigns a number of each community to each node. Making sense of community structure requires some human interpretation because the algorithm does not analyse the texts, only the connections between them. Please note that clustering algorithms are computationally intensive and can take a while.
We can assign a name to each community by using the node's name with the most backlinks within each community. The code below creates a table of the community memberships and joins it with the centrality table. The last part assigns a name to each community by selecting the node's name with the most backlinks, counts the number of members and arranges the table by community size.
communities <- tibble(Node = c$names, Community = c$membership) %>% left_join(centrality) %>% group_by(Community) %>% summarise(Name = Node[which.max(Backlinks)], Nodes = n()) %>% arrange(desc(Nodes))
Visualise Org-Roam
Now that we have analysed the graph, it is time to visualise the results. iGraph has extensive plotting functionality that gives full control over every aspect of the visualisation. To create the graph shown at the start of this article, use the following code:
## Visualise graph par(mar = c(0, 0, 0, 0)) plot(g, layout = layout.fruchterman.reingold, mark.border = NA, vertex.color = c$membership, vertex.label = NA, vertex.size = sqrt(degree(g, mode = "in") + 1), vertex.frame.color = NA, edge.arrow.size = 0, edge.color = "lightgrey")
The layout of graphs in this example follows the Fruchterman-Reingold method. This method tries to position the nodes so that the ones with the highest centrality are in the centre. Several other methods are available to compute an optimal layout.
The colour of each node relates to its community. The node's size is the square root of the number of backlinks (indegree) plus one so that larger nodes don't overpower the picture and small nodes vanish.
The remainder of the options improve the readability of the graph. This image is aesthetically pleasing, but it is not very informative. Adding text to the nodes would make it unreadable.
Subnetworks
When a network is fairly large it might be more informative to visualise only a section. The code below determines the number of the relevant node. The V()
function holds the properties of al vertices (the nodes) in the network. The make_ego_graph()
subsets a network from a given node (ego) and the order. This function creates a list of networks as you can enter more than one node.
## Subnetwork node <- which(names(V(g)) == "Topology") g1 <- make_ego_graph(g, order= 2, nodes = node) c1 <- cluster_walktrap(g1[[1]]) plot(g1[[1]], layout = layout.kamada.kawai, mark.border = NA, vertex.size = sqrt(degree(g1[[1]])) + 2, vertex.color = c1$membership, vertex.frame.color = NA, vertex.label.family = "Sanserif", vertex.label.color = "black", edge.arrow.size = .3)
Interactive visualisation
The networkD3 package can create interactive graphs, as implemented in this code snippet. The first step converts the iGraph object to a D3 object. Then, the htmlwidgets package lets you save this object as an interactive HTML file.
The code below adds the number of backlinks to the node table. Then, it defines the legend using the community names described above.
## Interactive visualisation library(networkD3) library(htmlwidgets) ## Convert to object suitable for networkD3 d3 <- igraph_to_networkD3(g, group = communities$Name[membership(c)]) d3$nodes$backlinks <- degree(g, mode = "in") ## Create force directed network plot p <- forceNetwork(Links = d3$links, Nodes = d3$nodes, Source = "source", Target = "target", NodeID = "name", Group = "group", Nodesize = "backlinks", zoom = TRUE, legend = TRUE) saveWidget(p, file=paste0( getwd(), "/org-roam.html"))
Further Analysis
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.