Site icon R-bloggers

Extracting datasets from excel files in a zipped folder

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The title of the post is a bit long, but that’s the problem I was facing this morning: importing dataset from files, online. I mean, it was not a “problem” (since I can always download, and extract manually the files), more a challenge (I should be able to do it in R, directly). The files are located on ressources-actuarielles.net, in a zip file. Those are mortality tables used in French speaking African countries, and I guess that one problem came from special characters, such as “é” or “è”… When you open the zip file, you see a folder

and in that folder, several files that I would like to import

My first code was quite standard, unfortunately, it did not work

> library(xlsx)
> loc.url <- "http://ressources-a...CIMA.zip"
> temp <- tempfile()
> download.file(loc.url,temp)
trying URL 'http://ressources-a...CIMA.zip'
Content type 'application/x-zip' length 503786 bytes (491 Kb)
opened URL
downloaded 491 Kb

> list.files <- unzip(temp,list=TRUE)
> data <- read.xlsx(unz(temp,
+ list.files$Name[1]),sheetIndex=1)
Error in path.expand(file) : invalid 'path' argument

Indeed, there was a problem with the path,

> list.files$Name[1]
[1] "Tables de mortalit‚ CIMA/CIMA F.xlsx"

Hopefully, as usual, @3wen came to rescue me, and suggested the following,

> Sys.setlocale("LC_ALL", "C")
[1] "C"
> loc.url <- "http://ressources-a...CIMA.zip"
> td <- tempdir() 
> tf <- tempfile(tmpdir=td, fileext=".zip") 
> download.file(loc.url, tf) 
trying URL 'http://ressources-a...CIMA.zip' Content type 'application/x-zip' length 503786 bytes (491 Kb) opened URL 
============================================= downloaded 491 Kb   

> fname <- unzip(tf, list=TRUE)$Name[1] 
> unzip(tf, files=fname, exdir=td,
+ overwrite=TRUE) 
> fpath <- file.path(td, fname)

That was it… then @3wen suggested another package to read the xlsx file (that works well on my linux laptop, not on Windows)

> data <- read.xls(fpath, sheet = 1)
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "fr_CA:fr",
	LC_ALL = (unset),
	LANG = "fr_CA.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
> Sys.setlocale()
[1] ""

> head(data)
  Age Lx..CIMA.F. qx..CIMA.F. dx..CIMA.F.
1   0   1000000.0     0.2849%   2848.5034
2   1    997151.5     0.0387%    385.8644
3   2    996765.6     0.0296%    295.2717
4   3    996470.4     0.0252%    251.5163
5   4    996218.8     0.0216%    214.9381
6   5    996003.9     0.0197%    195.9929

That was more subtle than expected…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.