Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
library(rjson) library(RCurl) players <- c("Lionel_Messi","George_Best","Cristiano_Ronaldo") players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json")))) players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]]) players.unlist <- unlist(players.info)
This is the point of manual intervention, where the variable selection has to be done. At this stage, it is highly recommended to have at least one example loaded in some JSON parser (JSON Viewer, or any browser’s JSONView add-in). The names of the vector generated by unlist() (“players.unlist” in the example) are a concatenation of the names of the lists that value was in, separated by “.”.
The elements in the character vector are ordered in the same way they appeared in the lists. So, the first thing to do is to establish where each of the cases start. In order to do so, the challenge now is to find a common pattern between the vector names which can identify where each of the elements start. This element does not necessarily have to be inside the final table. In this case I chose “surname\.value$”. However, I would recommend, in most of the cases, to choose the first element that appears in the object.
Similarly, the challenge now is to find common patterns in the names of the vector elements that will define each of the variables. In order to do that, the wisest would be to takle a look at the names that you have. In this case:
> unique(names(players.unlist)) [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.type" [2] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.value" [3] "http://www.w3.org/2002/07/owl#sameAs.type" [4] "http://www.w3.org/2002/07/owl#sameAs.value" [5] "http://www.w3.org/2000/01/rdf-schema#label.type" [6] "http://www.w3.org/2000/01/rdf-schema#label.value" [7] "http://www.w3.org/2000/01/rdf-schema#label.lang" [8] "http://purl.org/dc/terms/subject.type" [9] "http://purl.org/dc/terms/subject.value" [10] "http://xmlns.com/foaf/0.1/homepage.type" [11] "http://xmlns.com/foaf/0.1/homepage.value" [12] "http://xmlns.com/foaf/0.1/depiction.type" [13] "http://xmlns.com/foaf/0.1/depiction.value" [14] "http://purl.org/dc/elements/1.1/description.type" [15] "http://purl.org/dc/elements/1.1/description.value" [16] "http://purl.org/dc/elements/1.1/description.lang" [17] "http://xmlns.com/foaf/0.1/givenName.type" [18] "http://xmlns.com/foaf/0.1/givenName.value" [19] "http://xmlns.com/foaf/0.1/givenName.lang" [20] "http://xmlns.com/foaf/0.1/name.type"
> grep("(birth|Birth).*\.value",unique(names(players.unlist)),value=T) [1] "http://dbpedia.org/ontology/birthName.value" [2] "http://dbpedia.org/property/birthDate.value" [3] "http://dbpedia.org/property/birthPlace.value" [4] "http://dbpedia.org/property/dateOfBirth.value" [5] "http://dbpedia.org/property/placeOfBirth.value" [6] "http://dbpedia.org/ontology/birthDate.value" [7] "http://dbpedia.org/ontology/birthPlace.value" [8] "http://dbpedia.org/ontology/birthYear.value" [9] "http://dbpedia.org/property/birthName.value"
Now we know that there are several different birth dates fields. In this case, we should take a look manually and based upon that choose the one that better fits our needs.In this example, I chose 6 arbitrary variables, extracted its pattern and chose suitable variable names for them. Please notice that, for example, date of death should be empty for Messi and Ronaldo while number is empty in the case of George Best (players didn’t use to have a fixed number in the past). Apart from that, “goals” has multiple entries per player, as the json has one value per club. This is the end of the script:
st.obj <- "^.+surname\.value$" columns <- c("fullname\.value","ontology/height\.value", "dateOfBirth\.value","dateOfDeath\.value", "ontology/number\.value","property/goals.value$") colnames <- c("full.name","height","date.of.birth","date.of.death","number","goals") players.table <- tabulateJSON(players.unlist,st.obj,columns,colnames)
And this is the final result
> players.table full.name height date.of.birth date.of.death [1,] "Lionel Andrés Messi" "1.69" "1987-06-24+02:00" NA [2,] "George Best" "1.524" "1946-05-22+02:00" "2005-11-25+02:00" [3,] "Cristiano Ronaldo dos Santos Aveiro" "1.85" "1985-02-05+02:00" NA number goals [1,] "10" "6;5;242" [2,] NA "6;2;3;0;1;15;12;8;33;137;21" [3,] "7" "3;84;176"
Now let’s take on how look tabulateJSON() works. Below, the entire code of the function:
tabulateJSON <- function (json.un, start.obj, columns, colnames) { if (length(columns) != length(colnames)) { stop("'columns' and 'colnames' must be the same length") } start.ind <- grep(start.obj, names(json.un)) col.indexes <- lapply(columns, grep, names(json.un)) col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind)) temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F)) collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind)) if(length(collapse.cols) > 0){ temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";"))) } matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames) matr$pos <- NULL names(matr) <- colnames matr <- as.matrix(matr) colnames(matr) <- colnames return(matr) }
How does it work? Firstly, it looks for the items that define a the start of an object based upon the pattern passed and return their indexes. These will be the delimiters.
start.ind <- grep(start.obj, names(json.un))
After that, it will look for the names value that match the pattern passed in “columns” and return the indexes. Those indexes, have to be assigned to a position (i.e., a row number), which is done with findInterval(). This function maps a number to an interval given cut points.
For each of the variables, a data frame with 2 variables is created, where the first one is the position and the second one the value itself, which is obtained by accessing the values by the column indexes in the character vector. As it will be seen later, this is done in this way because it might be possible that for a sought pattern (that is converted to a variable) there might be more than one match in a row. Of course, that becomes problematic when trying to generate a tabulated dataset. For that reason, it is possible that temp.frames contains data frames with different number of rows.
col.indexes <- lapply(columns, grep, names(json.un)) col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind)) temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F))
collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind)) if(length(collapse.cols) > 0){ temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";"))) }
matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames) matr$pos <- NULL names(matr) <- colnames matr <- as.matrix(matr) colnames(matr) <- colnames return(matr)
getPlayerInfo <- function(players,variables){ players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json")))) players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]]) players.unlist <- unlist(players.info) st.obj <- "^.+surname\.value$" columns.to.grep <- paste0(variables,"\.value$") #Check if there is a multiple match with different types col.grep <- lapply(columns.to.grep, grep, x=unique(names(players.unlist))) columns <- sapply(col.grep, function(x) unique(names(players.unlist))[x[1]]) #Convert names to a regex columns <- gsub("\.","\\\.",columns) columns <- paste0("^",columns,"$") players.table <- tabulateJSON(players.unlist,st.obj,columns,variables) return(players.table) }
So, the call will be:
getPlayerInfo(c("Lionel_Messi","David_Beckham","Zinedine_Zidane","George_Best","George_Weah"),c("fullname","height","dateOfBirth","dateOfDeath","number","goals")) fullname height dateOfBirth [1,] "Lionel Andrés Messi" "1.69" "1987-06-24+02:00" [2,] "David Robert Joseph Beckham" "1.8288" "1975-05-02+02:00" [3,] "Zinedine Yazid Zidane" "1.85" "1972-06-23+02:00" [4,] "George Best" "1.524" "1946-05-22+02:00" [5,] "George Tawlon Manneh;Oppong Ousman Weah" "1.84" "1966-10-01+02:00" dateOfDeath number goals [1,] NA "10" "6;5;242" [2,] NA NA "62;2;0;13;18" [3,] NA NA "6;37;28;24" [4,] "2005-11-25+02:00" NA "6;2;3;0;1;15;12;8;33;137;21" [5,] NA NA "46;47;32;24;14;13;7;5;3;1"
I hope you enjoyed it and/or found it useful at least 😉
As usual, if you have any comments, suggestions, critics, please drop me a line
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.