Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This week I had the opportunity the trek up north to Silicon Valley to attend Yahoo’s Hadoop Summit 2010. I love Silicon Valley. The few times I’ve been there the weather was perfect (often warmer than LA), little to no traffic, no road rage and people overall seem friendly and happy. Not to mention there are so many trees it looks like a forest!
The venue was the Hyatt Regency Great America which seemed like a very posh business hotel. Walking into the lobby and seeing a huge crowd of enthusiasts and an usher every two steps was overwhelming!
After being welcomed by the sound of the vuvuzela, we heard about some statistics about the growth of Hadoop over the years. Apparently in 2008 there were 600 attendees to Hadoop Summit and in 2010 attendance grew to 1000. The conference started with three hours of keynotes and sessions that everyone attended in one room. The tone was somewhat corporate in nature, but there were also several gems in there about how Yahoo uses Hadoop:
- A user’s click streams are mapped to latent interests, similar to mapping words to interests in topic models like LSA and LDA. Yahoo uses Hadoop to recompute a user’s profile < emph>every five minutes.
- Yahoo Mail receives 40% less spam than Hotmail and 55% less spam than Gmail (this surprised me). Hadoop is used to train the spam detection model.
- Most common tweet: “Yahoo now stores 70PB of data in Hadoop, processing 3PB per day and injesting 100 billion events of 120 TB.”
Yahoo also discussed Hadoop Security, feature to integrate Hadoop with Kerberos to provide secure access and processing of data that is sensitive to business. Hadoop Security is currently in beta in the Yahoo distribution of Hadoop. Perhaps the biggest impact of Hadoop Security is that businesses can now colocate business sensitive data without worrying about unauthorized prying eyes. Yahoo’s distribution of Hadoop with Security can be downloaded from Github.
Yahoo also introduced Oozie, an open-source workflow system for managing Hadoop jobs including HDFS, Pig and MapReduce. Oozie seems similar to Chris Wensel’s (@cwensel) Cascading system. Oozie is open-source and can be downloaded from Github.
Yahoo’s transparency and contributions to the Hadoop and open-source community are very empowering. This quote sums it up well: “Hadoop is the technology behind every click on Yahoo!” In addition to Hadoop, Yahoo has open-sourced a project from the Inktomi days called Traffic Server and the Subversion repository is here. Yahoo also suggested that they will be moving on to a new venture: public cloud computing and storage, which would be a competitor to Amazon Web Services as well as Google Storage and a Google cloud computing solution that may or may not have already been announced (there were plans or rumors of a Google cloud computing service, but I cannot find any references to it).
Surprisingly, most of the focus on cloud computing, particularly Amazon Web Services, was on Elastic MapReduce rather than EC2. Amazon pre-announced several new features.
- Bootstrapping actions allow the user to specify certain operations to occur before the map/reduce process begins. If you want to install some Python package that does not exist on Elastic MapReduce, use a bootstrap action. The speaker used installing a newer version of R as a use case! < emph>Bootstrap actions are not new, but new enough to be mentioned..
- Users will be able to shrink or expand the size of the cluster or the HDFS in Elastic MapReduce, making it truly, well, elastic.
- Newer versions of Hadoop (0.20), Pig and Hive will, or have been, installed.
- Users will now be able to bid for extra capacity at a lower cost using spot instances, which have been available for EC2 for some time.
Cloudera had announcements of its own. Cloudera Desktop, a unified user interface for users and operators of Hadoop clusters as described by the developers, has been open-sourced and renamed HUE. You can download HUE from Cloudera’s GitHub repository. Cloudera will also offer monitoring, configuration support for its enterprise users via Cloudera Enterprise.
Over lunch I met up with Jakob Homan (@blueboxtraveler) from Yahoo and then with Charlie Glommen (@radarcg) from TouchCommerce. It is always great to meet fellow tweeters in person. I must say the lunch was awesome – the best risotto I have ever tasted! Anyway, we then broke off into three tracks: developer, applications, and research. I spent most of my time in the research track. The first three sessions were about working with large graphs, and I wished I had seen them BEFORE struggling so much with a large graph in my Master’s thesis.
First up was Jimmy Lin (@lintool) from University of Maryland. He introduced some large graph design patterns (slides) that can decrease running time by 70%. His method involves using message passing where computations are performed at each vertex and partial results are passed along the edges as messages. The first design patter was im-mapper combining to make local aggregation more efficient. The second design pattern, smarter partitioning creates more opportunities to create local aggregation by using range partitioning instead of hash partitioning. The final design pattern was called “Schimmy” which was named after the developers and is essentially a merge join between messages and graph structure. More information can be found in Schatz and Lin’s paper. Lin also commented on that fact that current Algorithms courses focus on serial algorithms and that there are many things that one must take into account when developing parallel algorithms. Lin has written a book to introduce students and other users to the MapReduce way of thinking. It is available online for free.
Next up was Christos from the famous Faloutsos family of computer science. His talk was more theoretical (slides) and had less emphasis on Hadoop, but was very interesting. He introduced a design pattern method called EigenSpokes for detecting triangles and connected structure of large graphs. He also discussed some findings from his PEGASUS system. Finally, Sergei Vassilvitskii from Yahoo presented the appropriately titled “XXL Graph Algorithms” talk (slides). He suggested the following approach for working with large graphs:
- Find the connected components.
- Partition the graph (the first map operation).
- Summarize the connectivity in each partition (first reduce operation).
- Combine all of the small summaries (the second map operation).
- Perform second reduce operation.
I also attended a talk on set similarity joins by Chen Li (UC Irvine) where he suggested that minhash does not find all similar items because it is probabilistic (I somewhat disagree, but that’s just the facts). The materials for his research, including source code, is on here. I also attended “Exact Inference in Bayesian Networks using MapReduce” (slides) by Alex Kozlov of Cloudera. At this point, my brain was becoming more inefficient at absorbing knowledge! During this session I finally got to meet Flip Kromer (@mrflip) from Infochimps, Anand Kishore from Yahoo (@semanticvoid) and Pete Skomoroch (@peteskomoroch) from LinkedIn, which lead to me also meeting Florian Leibert (@floleibert) from Twitter.
I then decided to switch tracks and I attended a presentation (slides) of Cascalog by developer Nathan Marz (@nathanmarz). Cascalog is a very expressive querying language for Hadoop. It is based on Clojure, a Lisp dialect written on top of the Java Virtual Machine, and its syntax is very similar to Datalog (or Prolog). In some ways, its simplicity to import a file and run a job on it seems to parallel R. I can’t wait to dive into this some more! I also attended Jerome Boulon’s (Netflix) talk on Honu, which is a large scale streaming data collection and processing pipeline (slides).
After a long day, I really enjoyed the free reception: pizza, cornbread, and really good quesadilla slices. I finished the day off eating dinner at Redbrick Pizza while watching the pathetic UCLA vs. U. South Carolina game and wishing I could stay in Silicon Valley longer 
The slides for Hadoop Summit 2010 tracks are now available. Videos should be available in a week.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
