Quick notes from Strata NYC 2012
[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The O'Reilly Strata conferences are always great fun to attend, and this latest installment in New York City is no exception. This one is super-busy though; the conference has been sold out for weeks — and not just marketing-sold-out, it's fire-department-sold out. It's non-stop conversations and presentations, and it's tough to move through the hallways in between.
Nonetheless, I thought I'd pause for a couple of minutes and share some of the highlights for me so far.
- Ed Kohlwey and Stephanie Beben gave a three-hour tutorial on the RHadoop project, showing the packed room how to crunch big data. They shared how consulting firm Booz Allen Hamilton uses R and Hadoop for data exploration; to run many tasks in parallel; and to sort, sample and join data. They've also create a very handy VirtualBox VM including R, Hadoop, RHadoop and RStudio (along with demonstration script files) which I hope to be able to post a download link for soon.
- Stan Humphries from Zillow gave a presentation on how data and statistical analysis drives Zillow's home valuation service. One fascinating tidbit: while Zillow has long used R to fit their valuation model, until recently they recoded the model scoring algorithm in C++ for use on the production site. The process of re-implementing a new version of the model, validating it, and deploying it used to take 9 months. But now that they run R in production via the Amazon cloud, without the need to recode the model in another language, the deployment time for new valuation models is just four weeks.
- Mike Driscoll from Metamarkets shared the technology behind their data stack: node.js and D3 for visualization; R and Scala for analytics; Druid as the data store; and Hadoop and Kafka for ETL. Druid is MetaMarket's home-grown high-performance, which they announced today is now available as open source software.
- In a similar vein, Cloudera announced the release of Impala, an open-source project two years in the making to bring high-performance real-time analytics to Hadoop.
- And there were even more announcements: Kaggle launched a partnership with EMC to give Greenplum users direct access to the roster of Kaggle data scientists competitors.
It's been a great conference so far, and this is only day one! Looking forward to more great talks and conversations tomorrow.
To leave a comment for the author, please follow the link and comment on their blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.