Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Metadata! Metadata is very cool. It’s super hot right now – everybody is talking about it. Okay, maybe not everyone, but it’s an important part of archiving scholarly work.
We are working on a repo on GitHub rmetadata
to be a one stop shop for querying metadata from around the web. Various repos on GitHub we have started – rpmc, rdatacite, rdryad, rpensoft, rhindawi – will at least in part be folded into rmetadata
.
As a start we are writing functions to hit any metadata services that use the OAI-PMH: “Open Archives Initiative Protocol for Metadata Harvesting” framework. OAI-PMH
has six methods (or verbs as they are called) for data harvesting that are the same across different metadata providers:
GetRecord
Identify
ListIdentifiers
ListMetadataFormats
ListRecords
ListSets
OAI-PMH
provides an updating list of data providers, which we can easily use to get the base URLs for their data. Then we just use one of the six above methods to query their metadata.
Let’s install rmetadata first.
install_github("rmetadata", "ropensci") library(rmetadata)
The most basic thing you can do with OAI-PMH
is identify the data provider, getting their basic information. The Identify
verb.
# one provider md_identify(provider = "datacite")
repositoryName baseURL protocolVersion 1 DataCite MDS http://oai.datacite.org/oai 2.0 adminEmail earliestDatestamp deletedRecord 1 admin@datacite.org 2011-01-01T00:00:00Z no granularity compression compression.1 1 YYYY-MM-DDThh:mm:ssZ gzip deflate description 1 oai, oai.datacite.org, :, oai:oai.datacite.org:12425, http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd
# many providers md_identify(provider = c("datacite", "pensoft"))
repositoryName baseURL protocolVersion 1 DataCite MDS http://oai.datacite.org/oai 2.0 2 Pensoft Publishers http://oai.pensoft.eu 2.0 adminEmail earliestDatestamp deletedRecord 1 admin@datacite.org 2011-01-01T00:00:00Z no 2 NULL 2008-07-04 no granularity compression compression.1 1 YYYY-MM-DDThh:mm:ssZ gzip deflate 2 YYYY-MM-DD NULL NULL description 1 oai, oai.datacite.org, :, oai:oai.datacite.org:12425, http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd 2 NULL
# no match for one, two matches for other md_identify(provider = c("harvard", "journal"))
$harvard x 1 no match found $journal repo_name 1 Hrcak - Portal of scientific journals of Croatia 2 International journal of Power Electronics Engineering
# let's pick one from the second md_identify(provider = "Hrcak")
repositoryName 1 Hrcak - Portal of scientific journals of Croatia baseURL protocolVersion adminEmail 1 http://hrcak.srce.hr/oai/ 2.0 hrcak@srce.hr earliestDatestamp deletedRecord granularity 1 2005-12-01 no YYYY-MM-DD description 1 oai, hrcak.srce.hr, :, oai:hrcak.srce.hr:anIdentifier, http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd
There are a variety of metadata formats, depending on the data provider – list them with the ListMetadataFormats
verb.
# List metadata formats for a provider md_listmetadataformats(provider = "dryad")
metadataPrefix 1 oai_dc 2 rdf 3 ore 4 mets schema 1 http://www.openarchives.org/OAI/2.0/oai_dc.xsd 2 http://www.openarchives.org/OAI/2.0/rdf.xsd 3 http://tweety.lanl.gov/public/schemas/2008-06/atom-tron.sch 4 http://www.loc.gov/standards/mets/mets.xsd metadataNamespace 1 http://www.openarchives.org/OAI/2.0/oai_dc/ 2 http://www.openarchives.org/OAI/2.0/rdf/ 3 http://www.w3.org/2005/Atom 4 http://www.loc.gov/METS/
# List metadata formats for a specific identifier for a provider md_listmetadataformats(provider = "pensoft", identifier = "10.3897/zookeys.1.10")
identifier metadataPrefix 1 10.3897/zookeys.1.10 oai_dc 2 10.3897/zookeys.1.10 mods schema 1 http://www.openarchives.org/OAI/2.0/oai_dc.xsd 2 http://www.loc.gov/standards/mods/v3/mods-3-1.xsd metadataNamespace 1 http://www.openarchives.org/OAI/2.0/oai_dc/ 2 http://www.loc.gov/mods/v3
The ListRecords
verb is used to harvest records from a repository
head(md_listrecords(provider = "datacite")[[1]][, 2:4])
identifier datestamp setSpec 1 oai:oai.datacite.org:32153 2011-06-08T08:57:11Z TIB 2 oai:oai.datacite.org:32200 2011-06-20T08:11:08Z TIB 3 oai:oai.datacite.org:32220 2011-06-28T14:11:08Z TIB 4 oai:oai.datacite.org:32241 2011-06-30T13:24:45Z TIB 5 oai:oai.datacite.org:32255 2011-07-01T12:09:24Z TIB 6 oai:oai.datacite.org:32282 2011-07-05T09:08:10Z TIB
ListIdentifiers
is an abbreviated form of ListRecords
, retrieving only headers rather than records.
# Single provider md_listidentifiers(provider = "datacite", set = "REFQUALITY")[[1]][1:10]
[1] "oai:oai.datacite.org:32426" "oai:oai.datacite.org:32152" [3] "oai:oai.datacite.org:25453" "oai:oai.datacite.org:25452" [5] "oai:oai.datacite.org:25451" "oai:oai.datacite.org:25450" [7] "oai:oai.datacite.org:25449" "oai:oai.datacite.org:25407" [9] "oai:oai.datacite.org:48328" "oai:oai.datacite.org:48439"
md_listidentifiers(provider = "dryad", from = "2012-07-15")[[1]][1:10]
[1] "oai:datadryad.org:10255/dryad.9106" [2] "oai:datadryad.org:10255/dryad.33780" [3] "oai:datadryad.org:10255/dryad.33901" [4] "oai:datadryad.org:10255/dryad.33902" [5] "oai:datadryad.org:10255/dryad.34472" [6] "oai:datadryad.org:10255/dryad.34558" [7] "oai:datadryad.org:10255/dryad.39975" [8] "oai:datadryad.org:10255/dryad.35065" [9] "oai:datadryad.org:10255/dryad.35081" [10] "oai:datadryad.org:10255/dryad.35082"
# Many providers out <- md_listidentifiers(provider = c("datacite", "pensoft"), from = "2012-08-21") llply(out, function(x) x[1:10]) # display just a few of them
[[1]] [1] "oai:oai.datacite.org:1099317" "oai:oai.datacite.org:1099572" [3] "oai:oai.datacite.org:1099824" "oai:oai.datacite.org:1099695" [5] "oai:oai.datacite.org:1088239" "oai:oai.datacite.org:1088122" [7] "oai:oai.datacite.org:1088190" "oai:oai.datacite.org:1175749" [9] "oai:oai.datacite.org:1175288" "oai:oai.datacite.org:1115603" [[2]] [1] "10.3897/phytokeys.16.2884" "10.3897/phytokeys.16.3602" [3] "10.3897/phytokeys.16.3186" "10.3897/zookeys.216.3407" [5] "10.3897/zookeys.216.3332" "10.3897/zookeys.216.3224" [7] "10.3897/zookeys.216.3769" "10.3897/zookeys.216.3360" [9] "10.3897/zookeys.216.3646" "10.3897/neobiota.14.3140"
With ListSets
you can retrieve the set structure of a repository.
# arXiv, returns a data.frame head(md_listsets(provider = "arXiv")[[1]])
setName setSpec 1 Computer Science cs 2 Mathematics math 3 Nonlinear Sciences nlin 4 Physics physics 5 Astrophysics physics:astro-ph 6 Condensed Matter physics:cond-mat
# many providers, returns a list md_listsets(provider = c("pensoft", "arXiv"))
[[1]] setName setSpec 1 ZooKeys zookeys 2 BioRisk biorisk 3 PhytoKeys phytokeys 4 NeoBiota neobiota 5 Journal of Hymenoptera Research jhr 6 International Journal of Myriapodology ijm 7 Comparative Cytogenetics compcytogen 8 Subterranean Biology subtbiol 9 Nature Conservation natureconservation 10 MycoKeys mycokeys [[2]] setName setSpec 1 Computer Science cs 2 Mathematics math 3 Nonlinear Sciences nlin 4 Physics physics 5 Astrophysics physics:astro-ph 6 Condensed Matter physics:cond-mat 7 General Relativity and Quantum Cosmology physics:gr-qc 8 High Energy Physics - Experiment physics:hep-ex 9 High Energy Physics - Lattice physics:hep-lat 10 High Energy Physics - Phenomenology physics:hep-ph 11 High Energy Physics - Theory physics:hep-th 12 Mathematical Physics physics:math-ph 13 Nuclear Experiment physics:nucl-ex 14 Nuclear Theory physics:nucl-th 15 Physics (Other) physics:physics 16 Quantum Physics physics:quant-ph 17 Quantitative Biology q-bio 18 Quantitative Finance q-fin 19 Statistics stat
Retrieve an individual metadata record from a repository using the GetRecord
verb.
# Single provider, one identifier md_getrecord(provider = "pensoft", identifier = "10.3897/zookeys.1.10")
identifier datestamp 1 10.3897/zookeys.1.10 2008-07-04 dc.title 1 A new candidate for a Gondwanaland distribution in the Zodariidae (Araneae): Australutica in Africa dc.creator dc.subject dc.subject.1 dc.subject.2 dc.subject.3 1 Jocqué,Rudy new species Gondwanaland Soutpansberg Araneae dc.source 1 ZooKeys 1: 59-66 dc.description 1 Two new species of Australutica Jocqué, 1995, a genus formerly only known from Australia, are described from South Africa: A. africana n. sp. from Soutpansberg and A. normanlarseni n. sp. from the Cape Peninsula. The taxonomic position of the new species is discussed and a key to the species of Australutica is provided. dc.publisher dc.date dc.type dc.format 1 Pensoft Publishers 2008 Research Article text/html dc.identifier 1 http://dx.doi.org/10.3897/zookeys.1.10 dc.identifier.1 dc.language 1 http://www.pensoft.net/journals/zookeys/article/10/ en dc..attrs 1 http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd
# Single provider, multiple identifiers md_getrecord(provider = "pensoft", identifier = c("10.3897/zookeys.1.10", "10.3897/zookeys.4.57"))
identifier datestamp 1 10.3897/zookeys.1.10 2008-07-04 2 10.3897/zookeys.4.57 2008-12-17 dc.title 1 A new candidate for a Gondwanaland distribution in the Zodariidae (Araneae): Australutica in Africa 2 Studies of Tiger Beetles. CLXXVIII. A new Lophyra (Lophyra) from Somaliland (Coleoptera, Cicindelidae) dc.creator dc.subject dc.subject.1 dc.subject.2 dc.subject.3 1 Jocqué,Rudy new species Gondwanaland Soutpansberg Araneae 2 Cassola,Fabio Tiger Beetles Cicindelidae Lophyra Somaliland dc.source 1 ZooKeys 1: 59-66 2 ZooKeys 4: 65-69 dc.description 1 Two new species of Australutica Jocqué, 1995, a genus formerly only known from Australia, are described from South Africa: A. africana n. sp. from Soutpansberg and A. normanlarseni n. sp. from the Cape Peninsula. The taxonomic position of the new species is discussed and a key to the species of Australutica is provided. 2 A new tiger beetle species, Lophyra (Lophyra) praetermissa n. sp. (Coleoptera, Cicindelidae), obviously a close relative of L. (L.) histrio (Tschitschérine, 1903), is described from the environs of Erigavo, Somaliland (northern Somalia). Its discovery thus brings up to 73 the number of the species of this genus presently known worldwide (39 species of which - 29 from Africa - belong to the typonominal subgenus). dc.publisher dc.date dc.type dc.format 1 Pensoft Publishers 2008 Research Article text/html 2 Pensoft Publishers 2008 Research Article text/html dc.identifier 1 http://dx.doi.org/10.3897/zookeys.1.10 2 http://dx.doi.org/10.3897/zookeys.4.57 dc.identifier.1 dc.language 1 http://www.pensoft.net/journals/zookeys/article/10/ en 2 http://www.pensoft.net/journals/zookeys/article/57/ en dc..attrs 1 http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd 2 http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd
Cool, so I hope people find this post and package useful. Let me know what you think in comments below, or if you have code specific comments or additions, go to the GitHub repo for rmetadata
. In a upcoming post I will show an example of what you can do with rmetadata
in terms of an actual research question.
Get the .Rmd file used to create this post at my github account – or .md file.
Written in Markdown, with help from knitr, and nice knitr highlighting/etc. in in RStudio.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.