Microsoft Office Metadata with R
[This article was first published on Joe's Data Diner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sometimes I need to retrieve various items of metadata from Microsoft Office files. For the ‘old-style’ (i.e. ‘.doc’ and ‘.xls’) files perhaps a solution in python, such as hachoir, was the best way to extract this data from the ole2 file format – although perhaps it was always possible in R too? When I started digging around for a similar solution for the ‘new-style’ (i.e. ‘.xlsx’ and ‘.docx’) files I was pleasantly surprised to find the file structure is much more open, indeed it is called Office Open XML. I am by no means an expert but basically it is a zipped set of xml type files. This makes getting at the metadata so much easier. I found a simple example in python by zeekay on stack overflow. My code below is an unashamed replication of this in R.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(XML) | |
#use R's inbuilt unzip function, knowing that the required metadata is in docProps/core.xml | |
doc = xmlInternalTreeParse(unzip('test.docx','docProps/core.xml')) | |
#define the namespace | |
ns=c('dc'= 'http://purl.org/dc/elements/1.1/') | |
#extract the author using xpath query | |
author = xmlValue(getNodeSet(doc, '/*/dc:creator', namespaces=ns)[[1]]) |
To leave a comment for the author, please follow the link and comment on their blog: Joe's Data Diner.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.