ACM Data Mining Camp 2011: Report

Joseph Rickert

11 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(By Joseph Rickert.) In San Jose topics like big data, map reduce, predictive models, mobile analytics and crowdsourcing draw a crowd even on a Saturday. So it turned out that the ACM data Mining Camp and "un-conference" was a very "happening" way to spend a Saturday. Over 500 people attended the event at the Ebay "Town Hall" on North First street and a good number stayed for the entire, eleven hour day (the food was pretty good).

The day started off Mike Bowles delivering a very accessible two hour lecture on using map reduce for machine learning that he and Patricia Hoffman abstracted from classes they teach at the Hacker Dojo and elsewhere. This was the only part of the day for which there was a fee ($35 that went to the ACM), the rest of the day was free (including the food).

In his lecture, Mike went through a number of popular data mining algorithms: Canopy Clustering, Kmeans, OLS, support vector machines etc, all examples of a class of models called the Statistical Query Model, and showed how they may be implemented as map reduce algorithms for Hadoop.

Michael Reece, VP of Modeling and Optimization and Optimization at Quantcast gave the keynote address: Machine Learning on Big Data for personalized Internet Advertising. This was a dynamic, high energy talk in which Michael moved seamlessly between the business of Internet advertising, quantitative techniques and insider observations like:

"Most ads are being shown to the wrong person . . .The good news is that the glass is 1% full"

"When you set up a wish list, you will have someone advertising it to you until you buy"

"When you get a free credit rating, your credit score will be stapled to your cookie. Delete your cookie immediately!"

The afternoon was devoted to the "un-conference". In an un-conference anyone who is motivated can get 3 minutes or so to propose a session to the group. There is an immediate show of hands to gauge interest and the organizers decide which topics will make the cut based on the number of rooms available to hold parallel sessions. Then, they schedule the sessions on the fly, doing the best they can to avoid conflicts. If nothing else, an un-conference is a great way to find out what topics are hot. As might be expected, Hadoop was pretty hot with this crowd. Quite a few people attented Antonio Piccolboni's session on integrating R and Hadoop. Antonio described (slides here) the open source package, rmr, for writing Hadoop map reduce functions in R. Other sessions included panel discussions on the context of setting up big data systems (Context is Everything) and the benefits of software platforms distributed over large numbers of processors (Software as a Platform), Bayesian techniques, Moble Analytics and several more including one that I proposed on the $20K prize at inside-R for applying R to business applications.

A session on hiring attracted quire a few people. In fact, a minor theme of the camp was the buzz around the number of companies that have multiple, open recs for data scientists. Nearly all of the sponsors who gave 5 minute pitches on their respective company's before the keynote speech announced that they had several positions to fill, and at least on vendor referenced Drew Conway's Venn Diagram showing the skills needed.

I enjoyed the entire day, but if I had to pick a favorite session it would be the one eBay's Brian Johnson led on crowdsourcing. Brian started out by locating crowdsourcing in the context of the changing nature of work. He went on to describe the key idea of breaking up complex jobs into micro tasks that anyone can do, gave several examples of the kinds of tasks that people can do better than machines (e.g. matching photographs) and explained the economics behind projects like Amazon's Mechanical Turk. Brian told a touching story about crowdsourcing having provided some opportunity to people living in the hopeless conditions of a refugee camps, but he also talked about the potential for a dark side to crowdsourcing.

I am looking forward to Data Mining Camp 2012.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.