New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is more of a follow-up from yesterday’s post. The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there’s now a devtools
-installable package on github that makes it way easier to get information about the tables in a Word document and extract them—uniform or not.
There are plenty of examples in the GitHub README and also in the package examples. But, I will show the basic functionality here.
The package ships with four example Word documents, but we’ll work with the last one: complex.doc
. It has five tables and the last two have varying columns and rows and look like:
Let’s read those two in:
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr")) docx_tbl_count(complx) #> [1] 5 docx_describe_tbls(complx) #> Word document [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/docxtractr/examples/complex.docx] #> #> Table 1 #> total cells: 16 #> row count : 4 #> uniform : likely! #> has header : likely! => possibly [This, Is, A, Column] #> #> Table 2 #> total cells: 12 #> row count : 4 #> uniform : likely! #> has header : likely! => possibly [Foo, Bar, Baz] #> #> Table 3 #> total cells: 14 #> row count : 7 #> uniform : likely! #> has header : likely! => possibly [Foo, Bar] #> #> Table 4 #> total cells: 11 #> row count : 4 #> uniform : unlikely => found differing cell counts (3, 2) across some rows #> has header : likely! => possibly [Foo, Bar, Baz] #> #> Table 5 #> total cells: 21 #> row count : 7 #> uniform : likely! #> has header : unlikely docx_extract_tbl(complx, 4, header=TRUE) #> Source: local data frame [3 x 3] #> #> Foo Bar Baz #> 1 Aa BbCc NA #> 2 Dd Ee Ff #> 3 Gg Hh ii docx_extract_tbl(complx, 5, header=TRUE) #> Source: local data frame [6 x 3] #> #> Foo Bar Baz #> 1 Aa Bb Cc #> 2 Dd Ee Ff #> 3 Gg Hh Ii #> 4 Jj88 Kk Ll #> 5 Uu Ii #> 6 Hh Ii h |
It reads in “uniform” tables properly and will warn you if there is a header marked in Word but not asked for in the extraction.
Next steps are to both allow specifying column types and try to guess column types (readr
has some nice functions for this) and perhaps return more metadata (if possible).
Feature requests & bug reports are most welcome on GitHub.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.