Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few days ago, I published a post entitled extracting datasets from excel files in a zipped folder, because I wanted to use datasets that were online, in some (zipped) excel format. The first difficult part was the folder with a non-standard character (the French é). Because next week I should be using those dataset in a crash course in Gabon (in Africa), I wanted to make sure that everthing will go fine when we will run the code. And discussing @3wen‘s trick on Day 1 was maybe not the best way to explain that R is a very simple tool that should be used for data analysis…
To make thing easier, I did upload the xlsx files on my webpage. I wanted to use the xlsx R package. Unfortunately, on my linux laptop, I have troubles installing that package.
> install.packages("xlsx") Installing package into '/home/arthur/R/x86_64-pc-linux-gnu-library/3.1' (as 'lib' is unspecified) also installing the dependencies 'rJava', 'xlsxjars' trying URL 'http://cran.parentingamerica.com/src/contrib/rJava_0.9-6.tar.gz' Content type 'application/x-gzip' length 567515 bytes (554 Kb) opened URL ================================================== downloaded 554 Kb trying URL 'http://cran.parentingamerica.com/src/contrib/xlsxjars_0.6.1.tar.gz' Content type 'application/x-gzip' length 9477071 bytes (9.0 Mb) opened URL ================================================== downloaded 9.0 Mb trying URL 'http://cran.parentingamerica.com/src/contrib/xlsx_0.5.7.tar.gz' Content type 'application/x-gzip' length 312839 bytes (305 Kb) opened URL ================================================== downloaded 305 Kb ERROR: configuration failed for package 'rJava' Warning messages: 1: In install.packages("xlsx") : installation of package 'rJava' had non-zero exit status 2: In install.packages("xlsx") : installation of package 'xlsxjars' had non-zero exit status 3: In install.packages("xlsx") : installation of package 'xlsx' had non-zero exit status
I did experience similar problems with other packages,
> library(gdata) perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = "fr_CA:fr", LC_ALL = (unset), LANG = "fr_CA.UTF-8" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C").
Just to make sure that everything will work just fine on Monday morning, I thought that it might be good to save the dataset in a – standard – csv format.
Here we go,
> CIMAH<-read.table( + "http://freakonometrics.free.fr/CIMAH.csv", + header=TRUE,sep=";") > head(CIMAH) Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H. 1 0 1,000,000 0,5368% 5,368 2 1 994,632 0,0730% 726 3 2 993,906 0,0559% 555 4 3 993,351 0,0476% 473 5 4 992,878 0,0407% 404 6 5 992,474 0,0371% 368
Great, I can read the file. But I cannot use it. Commas are used here as decimal and thousands separators (yes, both).I knew about the French decimal separator (which is usually the comma), but usually, French people use a space as a thousands separator (not a comma).
The first step was to drop the comma used as a thousand separator (and convert also that factor as a number)
> dropcomma=Vectorize(function(x){ + as.numeric(paste(unlist(strsplit( + as.character(x),",")),collapse="")) })
The second step was to play with the percentage, to drop the percentage sign, and convert the comma as a decimal separator (the standard dot). I did that in two steps
> commatodot=function(x){ + return(as.numeric(paste(unlist(strsplit( + as.character(x),",")),collapse="."))) } > droppercent=Vectorize(function(x){ + comatodot(paste(unlist(strsplit( + as.character(x),"%")),collapse=""))/100 })
Then, we should use those functions on the three columns,
> CIMAH[,2]=dropcomma(CIMAH[,2]) > CIMAH[,3]=droppercent(CIMAH[,3]) > CIMAH[,4]=dropcomma(CIMAH[,4])
and indeed, it worked,
> head(CIMAH) Age Lx..CIMA.H. qx..CIMA.H. dx..CIMA.H. 1 0 1000000 0.005368 5368 2 1 994632 0.000730 726 3 2 993906 0.000559 555 4 3 993351 0.000476 473 5 4 992878 0.000407 404 6 5 992474 0.000371 368
Numbers are actual numbers, so I can plot some graphs, such that the mortality rate, per age,
> plot(CIMAH$Age,CIMAH$qx..CIMA.H.,log="y")
I could not imagine that it would be that long to read this excel spreadsheet and to get a (simple) graph…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.