Site icon R-bloggers

How can you do a smart job getting data from internet?

[This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’d like to explore more the capabilities of my statistical packages to get data online and allocate it in memory instead of download each dataset by hand. After all, I found this task is pretty easy, but got me out of bed for one night trying to find the most efficient way to loop across the files and store them in the right way. So, let’s start. You can find one file here, with a list of web address where each file we are about to download is allocated. This files contain all registered details about revenues and expenditures of each candidate in the last election in Brazil. That’s meaning, more than 22 thousands .csv2 files; each file represents a candidate (i). For this task, I’ll use just data of revenues. Finally, I’m going to show the same steps using R.

require(xlsx) web <- read.xlsx(file.choose(), 1) mysites =web$web rm(web) # remove it because I need a lot of memory; #run this code and relax for 3 or four hours; big.data <- NULL base <-NULL for (i in mysites) { try(base <- read.table(i, sep=";", header=T, as.is=T, fileEncoding="windows-1252"), TRUE) if(!is.null(base)) big.data <- rbind(big.data, base) } #… half day after names(big.data) head(big.data,10) tail(big.data, 10) fix(base) srt(big.data)

To leave a comment for the author, please follow the link and comment on their blog: Daniel MarcelinoDaniel Marcelino » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.