From JSON to Tables

nivangio

7 years ago

[This article was first published on My Data Atelier » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

“First things first”. Tautological, always true. However, sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless. In other words, first things first.

In my professional experience, I have heard thousands of potenitally brilliant ideas which couldn’t be even tested because the data scientist in question did not know how to handle data according to his/her needs. This has become particularly problematic with the popularisation of JSON: despite the undeniable advantages that this data structure has in terms of data storage, replication, etc, it presents a challenge for data scientist, as most algorithms require that the input data is passed in a tabular form. I have myself faced to this problem and, after a couple of times of creating some temporary solution, felt it was high time to come up to a general one: of course, the solution I have come up to is just one approach.

In order to understand it, two things from JSONs structure must be considered: Firstly, JSON objects are basically an array of objects. This means that, unlike traditional tabulated Databases there is no indication what they don´t have. This might seem trivial but if you want to tabulate JSONs it becomes certainly something to be solved. Secondly, this array of objects can contain objects which are arrays themselves. This means that, when transforming this to a table, that value has to be transformed somehow to an appropiate format for a table. The function in question contemplates these two scenarios: it is intended to be used not only to tabulate objects with identical structure but also for objects whose internal structure is different

Before deep diving into the code of the function, let us propose a scenario where this could be useful. Imagine we would like to tabulate certain information from professional football players. In this mini-example, we have chosen Lionel Messi, Cristiano Ronaldo and George Best. The information is retrieved from dbpedia, which exposes the corresponding JSONs of their wikipedia’s entries. In the code below, the first steps:


library(rjson)
library(RCurl)

players <- c("Lionel_Messi","George_Best","Cristiano_Ronaldo")

players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json"))))

players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]])

players.unlist <- unlist(players.info)

I have to admit that this is not the simplest example as the JSONs retrieved are extremely complex and need much more pre-processing than usual (for example, than most API calls). The problem is that nowadays most API calls need an API Key and so the example becomes less reproducible.

Back to the example, firstly the data for each player is retrieved using getURL from RCurl and then converted to list by using fromJSON() from rjson, which converts json strings or files to lists. All the call is done inside a lapply statement, so the object returned is basically a list of JSONs converted to list, i.e., a list of lists. After that, from each of the sub-lists( in this case, 3, as there were three players) we are selecting only the piece of information that refers to the player himself; in a few words, a bit of cleanup. Finally, the file is unlisted. unlist() is a function which turns a list (a list of lists is a list itself) into a named vector of a certain type, to which all the elements can be coerced to; in this case, characters.

This is the point of manual intervention, where the variable selection has to be done. At this stage, it is highly recommended to have at least one example loaded in some JSON parser (JSON Viewer, or any browser’s JSONView add-in). The names of the vector generated by unlist() (“players.unlist” in the example) are a concatenation of the names of the lists that value was in, separated by “.”.

The elements in the character vector are ordered in the same way they appeared in the lists. So, the first thing to do is to establish where each of the cases start. In order to do so, the challenge now is to find a common pattern between the vector names which can identify where each of the elements start. This element does not necessarily have to be inside the final table. In this case I chose “surname\.value$”. However, I would recommend, in most of the cases, to choose the first element that appears in the object.

Similarly, the challenge now is to find common patterns in the names of the vector elements that will define each of the variables. In order to do that, the wisest would be to takle a look at the names that you have. In this case:


> unique(names(players.unlist))
  [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.type"   
  [2] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.value"  
  [3] "http://www.w3.org/2002/07/owl#sameAs.type"              
  [4] "http://www.w3.org/2002/07/owl#sameAs.value"             
  [5] "http://www.w3.org/2000/01/rdf-schema#label.type"        
  [6] "http://www.w3.org/2000/01/rdf-schema#label.value"       
  [7] "http://www.w3.org/2000/01/rdf-schema#label.lang"        
  [8] "http://purl.org/dc/terms/subject.type"                  
  [9] "http://purl.org/dc/terms/subject.value"                 
 [10] "http://xmlns.com/foaf/0.1/homepage.type"                
 [11] "http://xmlns.com/foaf/0.1/homepage.value"               
 [12] "http://xmlns.com/foaf/0.1/depiction.type"               
 [13] "http://xmlns.com/foaf/0.1/depiction.value"              
 [14] "http://purl.org/dc/elements/1.1/description.type"       
 [15] "http://purl.org/dc/elements/1.1/description.value"      
 [16] "http://purl.org/dc/elements/1.1/description.lang"       
 [17] "http://xmlns.com/foaf/0.1/givenName.type"               
 [18] "http://xmlns.com/foaf/0.1/givenName.value"              
 [19] "http://xmlns.com/foaf/0.1/givenName.lang"               
 [20] "http://xmlns.com/foaf/0.1/name.type"

By taking a look at the JSON of any of these examples, we can see that “type” refers to the type of value of “value”. So, in this case, we know that we are going to need only the ones that end with “.value”. However, the JSONs are too large to do a complete scan of all its elements, so the wisest thing would be to look for the desired value manually. For example, if we would like to take the date of birth:


> grep("(birth|Birth).*\.value",unique(names(players.unlist)),value=T)
[1] "http://dbpedia.org/ontology/birthName.value"   
[2] "http://dbpedia.org/property/birthDate.value"   
[3] "http://dbpedia.org/property/birthPlace.value"  
[4] "http://dbpedia.org/property/dateOfBirth.value" 
[5] "http://dbpedia.org/property/placeOfBirth.value"
[6] "http://dbpedia.org/ontology/birthDate.value"   
[7] "http://dbpedia.org/ontology/birthPlace.value"  
[8] "http://dbpedia.org/ontology/birthYear.value"   
[9] "http://dbpedia.org/property/birthName.value"

Now we know that there are several different birth dates fields. In this case, we should take a look manually and based upon that choose the one that better fits our needs.In this example, I chose 6 arbitrary variables, extracted its pattern and chose suitable variable names for them. Please notice that, for example, date of death should be empty for Messi and Ronaldo while number is empty in the case of George Best (players didn’t use to have a fixed number in the past). Apart from that, “goals” has multiple entries per player, as the json has one value per club. This is the end of the script:

st.obj <- "^.+surname\.value$"

columns <- c("fullname\.value","ontology/height\.value",
             "dateOfBirth\.value","dateOfDeath\.value",
             "ontology/number\.value","property/goals.value$")

colnames <- c("full.name","height","date.of.birth","date.of.death","number","goals")

players.table <- tabulateJSON(players.unlist,st.obj,columns,colnames)

And this is the final result


> players.table
     full.name                             height  date.of.birth      date.of.death     
[1,] "Lionel Andrés Messi"                 "1.69"  "1987-06-24+02:00" NA                
[2,] "George Best"                         "1.524" "1946-05-22+02:00" "2005-11-25+02:00"
[3,] "Cristiano Ronaldo dos Santos Aveiro" "1.85"  "1985-02-05+02:00" NA                
     number goals                        
[1,] "10"   "6;5;242"                    
[2,] NA     "6;2;3;0;1;15;12;8;33;137;21"
[3,] "7"    "3;84;176"

Finally, we get to the function. tabulateJSON() expects four parameters: an unlisted json (or whatever character vector that has similar characteristics as the ones produced by unlisting a JSON), a string that represents the name of the starting positions of the elements, a vector of characters (normally regex patterns) with the element names to be saught and finally the names to assign to those columns generated.

Now let’s take on how look tabulateJSON() works. Below, the entire code of the function:


tabulateJSON <- function (json.un, start.obj, columns, colnames) 
{
  if (length(columns) != length(colnames)) {
    stop("'columns' and 'colnames' must be the same length")
  }
  start.ind <- grep(start.obj, names(json.un))
  
  col.indexes <- lapply(columns, grep, names(json.un))
  col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind))
  
  
    temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))
    
    collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind))
  
  if(length(collapse.cols) > 0){
    temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) 
      ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";")))
    
  }
  
  matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames)
  matr$pos <- NULL
  names(matr) <- colnames
  matr <- as.matrix(matr)
  colnames(matr) <- colnames
  return(matr)
}

How does it work? Firstly, it looks for the items that define a the start of an object based upon the pattern passed and return their indexes. These will be the delimiters.

  start.ind <- grep(start.obj, names(json.un))

After that, it will look for the names value that match the pattern passed in “columns” and return the indexes. Those indexes, have to be assigned to a position (i.e., a row number), which is done with findInterval(). This function maps a number to an interval given cut points.

For each of the variables, a data frame with 2 variables is created, where the first one is the position and the second one the value itself, which is obtained by accessing the values by the column indexes in the character vector. As it will be seen later, this is done in this way because it might be possible that for a sought pattern (that is converted to a variable) there might be more than one match in a row. Of course, that becomes problematic when trying to generate a tabulated dataset. For that reason, it is possible that temp.frames contains data frames with different number of rows.

  col.indexes <- lapply(columns, grep, names(json.un))
  col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind))
    temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))

After this, it is necessary to collapse those variables which contain multiple values per row. Firstly, it checks whether there is any of the elemnts in the list whose amount of rows is greater than the amount of delimiters (which define the amount of rows). In the case there is any, the function uses plyr’s ddply() to collapse by row specifier (pos). After this process all the data frames in temp.frames will be of equal length:


    collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind))
  
  if(length(collapse.cols) > 0){
    temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) 
      ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";")))
    
  }

Finally, all the elements in temp.frames are merged one with another, the column name is assigned and “pos” is erased, as it does not belong to the dataset to be done:


  matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames)
  matr$pos <- NULL
  names(matr) <- colnames
  matr <- as.matrix(matr)
  colnames(matr) <- colnames
  return(matr)

Of course, using this function straightforward is a bit “uncomfortable”. For that reason, and depending your particular needs, you can include tabulateJSON() as part of a higher-level function. In this particular example, the function could receive the names of the and a list of high level variables, which will be mapped to a particular pattern, for example:


getPlayerInfo <- function(players,variables){
  
  players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json"))))
  
  players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]])
  
  players.unlist <- unlist(players.info)
  
  st.obj <- "^.+surname\.value$"
  
  columns.to.grep <- paste0(variables,"\.value$")
  
  #Check if there is a multiple match with different types
  
  col.grep <- lapply(columns.to.grep, grep, x=unique(names(players.unlist)))
  
  columns <- sapply(col.grep, function(x) unique(names(players.unlist))[x[1]])
  
  #Convert names to a regex
  
  columns <- gsub("\.","\\\.",columns)
  columns <- paste0("^",columns,"$")
  
  players.table <- tabulateJSON(players.unlist,st.obj,columns,variables)
  
  return(players.table)
  
}

So, the call will be:


getPlayerInfo(c("Lionel_Messi","David_Beckham","Zinedine_Zidane","George_Best","George_Weah"),c("fullname","height","dateOfBirth","dateOfDeath","number","goals"))
     fullname                                  height   dateOfBirth       
[1,] "Lionel Andrés Messi"                     "1.69"   "1987-06-24+02:00"
[2,] "David Robert Joseph Beckham"             "1.8288" "1975-05-02+02:00"
[3,] "Zinedine Yazid Zidane"                   "1.85"   "1972-06-23+02:00"
[4,] "George Best"                             "1.524"  "1946-05-22+02:00"
[5,] "George Tawlon Manneh;Oppong Ousman Weah" "1.84"   "1966-10-01+02:00"
     dateOfDeath        number goals                        
[1,] NA                 "10"   "6;5;242"                    
[2,] NA                 NA     "62;2;0;13;18"               
[3,] NA                 NA     "6;37;28;24"                 
[4,] "2005-11-25+02:00" NA     "6;2;3;0;1;15;12;8;33;137;21"
[5,] NA                 NA     "46;47;32;24;14;13;7;5;3;1"

I hope you enjoyed it and/or found it useful at least 😉

As usual, if you have any comments, suggestions, critics, please drop me a line

To leave a comment for the author, please follow the link and comment on their blog: My Data Atelier » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.