Amazon AWS Summit 2013
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was fortunate enough to have been able to attend the Amazon AWS Summit in NYC and to listen to Werner Vogels give the keynote. I will share a few of my thoughts on the AWS 2013 Summit and some of my take-aways. I attended sessions that focused on two products in particular: Redshift and DynamoDB. Amazon AWS seems to be a good option for projects that need a lot of disk space (e.g. up to 1.6 petabytes) or if you need to quickly increase system resources (e.g. RAM or CPU) to handle a lot of database writes or to handle a lot of data analysis on demand.
Redshift
This is a new Amazon product was announced earlier this month and if it can do what Amazon says it can do then it seems that this is a great option data warehousing. It will be interesting to see if some of the industries that have strict regulations (e.g. HIPAA, PCI compliant) move over to Amazon. However, with some of the Virtual Private Cloud options and the encryption that is provided that may be able to solve the regulatory requirements.
I have done a fair amount of work on data warehousing but have generally only used Oracle for that work. Some of the benchmarks for Redshift are quite impressive. As seen in the photo I took of one of the presentation slides they were able to read 2 billion rows in about 6 seconds (12 seconds for aggregate summaries). Compare that to SQL Server that was manually stopped after 6 hours and Hive took only about a half hour. Not too long ago I personally ran ~3 billion rows on a local MySQL server. I don’t have specific benchmarks. However, to scrub and transform the data it took roughly 3 days. Needless to say after that I moved over and used some of the Amazon products and was able to quickly scale up and use more Amazon instances.
Amazon DynamoDB
I haven’t had the opportunity to use this product but it look very promising and appears to provide a great resource for a NoSQL alternative to relational databases and a strong competitor to some of the other NoSQL databases. It is a proprietary software but I would be interested in comparing it to Cassandra or other Hadoop style systems. Some of the libraries, mappers, and mock are available at http://j.mp/dynamodb-libs.
Summary
From personal experience I have been able to use R and Hadoop as well as some PHP scripts and Java programs on Amazon instances. The price for using any of these products is very good and is generally a whole lot cheaper than purchasing in-house hardware. Plus it provides flexibility to use a wide range of software. It will be interesting to see how well Redshift performs as well as DynamoDB. Redshift in particular looks very promising.
As a side, I’m in no way associated with Amazon, I’m simply an occasional user of their products.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.