Querying Zenodo.org repository with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Zenodo
Zenodo is a repository which allows everybody to deposit free of charge any type of research output, in all disciplines of science.
EFSA is piloting it’s use for creating a knowledge base on all types of food safety related evidence(data, documents, models).
Zenodo has an API and can be queried using the standard OAI-PMH protocol, which allows to harvest the metadata and all deposits.
‘oai’ package
R has a package available to query any OAI-PMH repository, including Zenodo. It can be installed from CRAN like this:
The development version is available on Github at https://github.com/ropensci/oai
The libraries I use in this tutorial are:
Retreive records from Zenodo
The oai package allows to retrieve all records of a given Zenodo community, in this case the EFSA pilot community. The following code shows all records of a community with their digital object identifier and the title.
identifier.3 | title |
---|---|
10.5281/zenodo.57132 | EFSA Source Attribution Model (EFSA_SAM) |
10.5281/zenodo.57017 | PRIMo rev.1 – Pesticide Residue Intake Model |
10.5281/zenodo.56662 | Bee-Tool V.1 |
10.5281/zenodo.56668 | Bee-Tool V.2 |
10.5281/zenodo.154720 | Egg Pooling Module |
10.5281/zenodo.161300 | GMOANALYSIS VERSION 2.1.0 – 10 JULY 2014 |
10.5281/zenodo.159163 | Pesticide Residues Overview File: PROFile (3.0) |
10.5281/zenodo.154725 | Food Additives Intake Model (FAIM) – Version 1.1 – July 2013 |
10.5281/zenodo.163080 | Modelling continental-scale spread of Schmallenberg virus in Europe |
10.5281/zenodo.57079 | C-TSEMM – Cattle TSE Monitoring Model |
10.5281/zenodo.57505 | TSEi – TSE Infectivity Model |
10.5281/zenodo.159414 | Dietary Exposure Calculator Smoke Flavouring |
10.5281/zenodo.159890 | CHIP: Commodity based Hazard Identification Tool |
10.5281/zenodo.56287 | PRIMo rev.2 – Pesticide Residue Intake Model |
10.5281/zenodo.56669 | Bee-Tool V.3 |
10.5281/zenodo.161298 | Exposure of operators, workers, residents and bystanders in risk assessment for plant protection products calculator (Version 30MAR2015) |
10.5281/zenodo.163026 | Within farms transmission model for Schmallenberg Virus |
10.5281/zenodo.154724 | User-friendly interface version of the QMRA model for Salmonella in pigs |
Currently there are 18 records available.
Statistics on keywords
Query records from Zenodo
I was further on interested in the current distribution of keywords each record was tagged with. Zenodo supports two types of keywords. Simple free text keywords and ‘subjects’. Subjects need to come from a controlled vocabulary, in which each topic has an URI.
EFSA uses the GACS vocabulary, and so a certain topic ‘salmonella’ is represented as URI ‘http://browser.agrisemantics.org/gacs/en/page/C2225’.
The API returns therefore for the subjects only the URI, which is nicely unique and clear but not user friendly as a label. On the URI of each ‘subject’, additional information is available.
The following code retrieves all records and extract all their subjects (which have a Xpath of //d3:subject). The current oai package has some problems with some Zenodo specific metadata, so I parse the raw XML by hand.
The OIA-PMH standard and the oai::get_records function, allow the client to select, in which metadata format he wants to receive the metadata. Here I have selected ‘oai-datacite’, because it is recommended from the Zenodo API documenation and should contain all metadata Zenodo supports, while other metadata formats might only support a smaller subset.
. | n |
---|---|
food additives | 1 |
food additives intake model | 1 |
food composition difference testing | 1 |
http://id.agrisemantics.org/gacs/C22070 | 2 |
http://id.agrisemantics.org/gacs/C22092 | 1 |
http://id.agrisemantics.org/gacs/C2225 | 3 |
I use the ‘map’ function from the ‘purrr’ package to apply to every vector in the result (which is first an xml string) a number of transformations:
- read_xml() – to convert from string to class xml_document
- xml_find_all() – to find all xml nodes given by xpath expression
- xml_text() – get the text from the xml node
Then I combine all this via c() and the reduce() function to obtain a single list of all subjects.
The API returns both types of subjects, the generic keywords and the terms referring to a controlled vocabulary.
The table() command produces then a frequency table for them, of which I show here a subset. We have in this table entries with an English label, and some with the GACS URI.
Add human readable label to GACS topics
To add a human readable label to each GACS URI, I use the GACS API which allows to query information on each topic. So I call the API for each URI and make a table where each row contains a list of (URI,label). This gets the converted into a table with bind_rows()
I use again the ‘map’ function with an anonymous function, which does the call to the GACS API. GACS uses the (Skomsos)[https://github.com/NatLibFi/Skosmos] software, so has an (API)[http://api.finto.fi/doc/] to query the vocabulary.
uri | label |
---|---|
http://id.agrisemantics.org/gacs/C10152 | Bayesian theory |
http://id.agrisemantics.org/gacs/C10826 | commodities |
http://id.agrisemantics.org/gacs/C12237 | flavourings |
http://id.agrisemantics.org/gacs/C1263 | screening |
http://id.agrisemantics.org/gacs/C14046 | emerging infectious diseases |
Distributions of labels in efsa-pilot community
To get the final table, I join the label-GACS pairs with the former table and do some clean-up with the functions from tidyr package.
The table is then sorted by frequency and shown on the screen.
As we can see, the most frequent words are ‘risk assessment’ and ‘exposure assessment’, which is no surprise as these is the core of EFSA’s scientific work.
label | count |
---|---|
risk assessment – http://id.agrisemantics.org/gacs/C1470 | 8 |
quantitative analysis – http://id.agrisemantics.org/gacs/C603 | 7 |
exposure assessment – http://id.agrisemantics.org/gacs/C29232 | 6 |
population – http://id.agrisemantics.org/gacs/C2955 | 5 |
prion diseases – http://id.agrisemantics.org/gacs/C18728 | 4 |
pesticides – http://id.agrisemantics.org/gacs/C284 | 4 |
Apoidea – http://id.agrisemantics.org/gacs/C1932 | 3 |
Salmonella – http://id.agrisemantics.org/gacs/C2225 | 3 |
pesticide residues – http://id.agrisemantics.org/gacs/C3009 | 3 |
linear models – http://id.agrisemantics.org/gacs/C3504 | 3 |
model validation – http://id.agrisemantics.org/gacs/C4332 | 3 |
time – http://id.agrisemantics.org/gacs/C4525 | 3 |
pollinators – http://id.agrisemantics.org/gacs/C5325 | 3 |
decision support systems – http://id.agrisemantics.org/gacs/C8154 | 3 |
acute risk assesment | 2 |
chronic risk assesment | 2 |
Epidemiology | 2 |
exposure assessment | 2 |
bovine spongiform encephalopathy – http://id.agrisemantics.org/gacs/C14182 | 2 |
calculation – http://id.agrisemantics.org/gacs/C15337 | 2 |
To monitor regularly this distribution can help in keeping the list of all keywords clean and eventually propose additional subjects to the GACS vocabulary.
Session info
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.