Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have a challenge: extracting data from an enormous JSON file. The structure of the file is very challenging: it’s a mapping at the top level, which means that for most standard approaches the entire document needs to be loaded before it can be processed. It would have been so much easier if the top level structure was an array. But, alas. It’s almost as if the purveyors of the data have made it intentionally inaccessible.
Getting the Data
The data come from Transparency in Coverage and are in the form of a compressed JSON file.
wget -O united-healthcare-rates.json.gz https://tinyurl.com/united-healthcare-rates
Let’s see how big the file is.
ls -lh united-healthcare-rates.json.gz -rw-rw-r-- 1 ubuntu ubuntu 4.0G Jul 24 20:27 united-healthcare-rates.json.gz
The compressed JSON is 4 GiB. But that’s just the beginning. Now we need to decompress.
gunzip united-healthcare-rates.json.gz
How big is it now?
ls -lh united-healthcare-rates.json -rw-rw-r-- 1 ubuntu ubuntu 122G Aug 3 05:36 united-healthcare-rates.json
Whoah! 122 GiB. That’s a serious chunk of data.
The data schema is described here. However, that schema is of little use if we cannot load and parse the data.
Initial Attempts
My immediate reaction was: okay, I’ll process this in Python. Wrong! The file was far too big to read in one chunk (obviously!). What about using an iterative or streaming approach with something like the ijson
library? Seems like a good idea and would have worked if the JSON file was an array, but since it’s a mappin streaming is not a viable option.
What about loading the data into a document database like MongoDB and then using NoSQL to investigate the contents? Again, this seems like a reasonable idea. However, there are constraints on the size of documents which you can load into a document database. Another dead end.
Simple Shell Solution
There’s a handy shell command for working with JSON data: jq
. Install it if you don’t already have it.
sudo apt install jq
Document Keys
Let’s take a look at the mapping keys.
jq 'keys' united-healthcare-rates.json [ "in_network", "last_updated_on", "provider_references", "reporting_entity_name", "reporting_entity_type", "version" ]
That takes a while to run (because there’s a lot of data to parse), but we can immediately see the number and names of the key/value pairs.
Scalar Components
A few of the top level components of the document are simple scalars ("version"
, "last_updated_on"
, "reporting_entity_name"
and "reporting_entity_type"
). I can access these fairly easily using head
and grep
.
Array Components
The interesting data is in the "in_network"
and "provider_references"
components. I decided to unpack those into separate files so that I could deal with each individually.
jq ".in_network" united-healthcare-rates.json >in-network.json jq ".provider_references" united-healthcare-rates.json >provider-references.json
I now have two JSON files to work with. I extracted the first few records from one of them to show the next steps in the analysis.
For the purpose of this post I’m going to use R. You could equally use Python (or one of any other languages able to work with JSON).
library(tidyverse) library(jsonlite) PROVIDER_REFERENCES <- "provider-references-sample.json" IN_NETWORK <- "in-network-sample.json.bz2"
Let’s take a look at the data.
[ { "provider_groups": [ { "npi": [ 1720734973 ], "tin": { "type": "ein", "value": "870324719" } }, { "npi": [ 1770627234 ], "tin": { "type": "ein", "value": "870569774" } } ], "provider_group_id": 0 }, { "provider_groups": [ { "npi": [ 1386945947, 1588908388 ], "tin": { "type": "ein", "value": "371705906" } } ], "provider_group_id": 1 } ]
Now load a sample of the provider references data and do some rectangling to convert it into a tidy data frame.
unpack_group <- function(group) { with( group, tibble( provider_group_id, provider_groups ) ) %>% unnest_wider(provider_groups) %>% unnest_wider(tin, names_sep = "_") %>% unnest_longer(npi) %>% select(provider_group_id, tin_type, tin_value, npi) } read_json(PROVIDER_REFERENCES) %>% map_dfr(unpack_group) # A tibble: 4 × 4 provider_group_id tin_type tin_value npi <int> <chr> <chr> <int> 1 0 ein 870324719 1720734973 2 0 ein 870569774 1770627234 3 1 ein 371705906 1386945947 4 1 ein 371705906 1588908388
The structure of the in network data is a little more complicated but can be attacked using a similar approach. Here’s a sample of a subset of the fields.
# A tibble: 3 × 5 arrangement code code_type code_type_version name <chr> <chr> <chr> <chr> <chr> 1 ffs 0001A CPT 2022 IMM ADMN SARSCOV2 30MCG/0.3ML D… 2 ffs 0001U CPT 2022 RBC DNA HEA 35 AG 11 BLD GRP WH… 3 ffs 0001 MS-DRG 2022 Heart Transplant or Implant of …
Streaming
Using the read_json()
function to load the JSON data will not be practical with larger files. However, since we now have JSON arrays to deal with we can use a streaming approach.
First convert the data to NDJSON format.
jq -c '.[]' provider-references-sample.json >provider-references-sample-streaming.json
Now handle the document one line at a time using stream_in()
. This completely avoids having to load the entire document. The code below could be a lot more efficient, but it’s just to illustrate the approach.
Take a look at the resulting CSV file.
cat "provider-references-sample.csv" 0,ein,870324719,1720734973 0,ein,870569774,1770627234 1,ein,371705906,1386945947 1,ein,371705906,1588908388
Conclusion
The key to cracking this problem was using the jq
tool to break down the enormous JSON document into smaller parts. The resulting parts were still big, but had a different structure, which meant that I could use streaming techniques to process them.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.