Importing Large NDJSON Files into R

Matt

4 years ago

[This article was first published on RLang.io | R Language Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I ran into this problem recently when trying to import the data my twitter scraper produced and thought this might make a worthwhile post.

The file I was trying to import was ~30GB, which is absolutely monsterous. This was in part do to all of the fields I didn’t bother dropping before writing them to my data.json file.

The Process

The first thing I needed to do was figure out a managable size. Thankfully the ndjson format keeps the entire record on one line, so I could split the lines into an undetermined amount of files based on a known number of records my system was able to process with my memory (RAM) limit. I decided on 50,000 records, knowing my system could handle about 800,000 before filling up my RAM and paging file and that I planned on parallizing the process (16 threads) to speed it up quite dramatically.

I made sure I had an empty folder to write the split file segments to, and ran this command from my working directory in Terminal.

split -l 50000 data.json ./import/tweets_

Simply, right? Now we will probably want to see the variables (technically properties since these are javascript objects).

head -1 import/tweets_da | grep -oP '"([a-zA-Z0-9\-_]+)"\:'

This gives you an output similar to this

"id":
"text":
"source":
"truncated":
"user":
"id":
"name":
"location":
"url":
"description":
"protected":
"verified":
"lang":
"following":
"notifications":
"geo":
"coordinates":
"place":
"contributors":
"id":
"text":
"source":
"truncated":
"user":
"id":
"name":
"location":
"url":
"description":
"protected":
"verified":
...

Regular expressions are the best, arent they? Now for the R code which makes this buildup actually worthwhile.

library("data.table")
library("parallel")
library("jsonlite")
#Parallize this process on 16 threads
cluster <- makeCluster(16)
#Export the jsonlite function stream_in to the cluster
clusterExport(cluster,list("stream_in"))
#Create an empty list for the dataframe for each file
import <- list()
#Run this function on every file in the ./import directory
import <- parLapply(cluster,list.files(path = "./import"),function(file) {
    #jsonlite function to convert the ndjson file to a dataframe
  df <- stream_in(file(paste0("./import/",file)))
    #select which columns to keep
  df <- df[,c("text","created_at","lat","lng","id_str")]
  return(df)
})
#function called from the data.table library
df <- rbindlist(import)
#Now you can stop the cluster
stopCluster(cluster)

Now the system won’t bonk since it only is keeping in 5 variables! You will notice your RAM fluctuate quite a bit while reading in files, since the initial stream_in() loads all of the properties into the dataframe (sometimes with nesting). Once the columns are omitted the memory is freed up. Happy programming

To leave a comment for the author, please follow the link and comment on their blog: RLang.io | R Language Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.