Big data problems
[This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have big data problems.
I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments…
1) R is best for data that fits a computer’s RAM (so get more RAM if you can).
2) R can be used for datasets that don’t fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.
3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS 🙁 seem like the only option as they store data on file system and access it as needed.
Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.
References
http://www.bytemining.com/2010/07/taking-r-to-the-limit-part-i-parallelization-in-r/
http://www.austinacl.blogspot.com (image)
I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments…
1) R is best for data that fits a computer’s RAM (so get more RAM if you can).
2) R can be used for datasets that don’t fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.
3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS 🙁 seem like the only option as they store data on file system and access it as needed.
Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.
Can Stata, Matlab or RevolutionR analyse datasets in the range of 50 – 100GB effectively?
References
http://www.bytemining.com/2010/07/taking-r-to-the-limit-part-i-parallelization-in-r/
http://www.austinacl.blogspot.com (image)
To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.