Apache Spark for Big Analytics
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Thomas Dinsmore, Director of Product Management at Revolution Analytics
The emergence of Apache Spark is a key development for Big Analytics in 2013. Spark, an Apache incubator project, is an open source distributed computing framework for advanced analytics in Hadoop. Originally developed as a research project at UC Berkeley's AMPLab, the project achieved incubator status in Apache in June 2013.
Spark seeks to address the critical challenges for advanced analytics in Hadoop. First, Spark is designed to support in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data. This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.
Second, Spark offers an integrated framework for advanced analytics, including a machine learning library (MLLib); a graph engine (GraphX); a streaming analytics engine (Spark Streaming) and a fast interactive query tool (Shark). This eliminates the need to support multiple point solutions, such as Giraph, GraphLab and Tez for graph engines; Storm and S3 for streaming; or Hive and Impala for interactive queries. A single platform simplifies integration, and ensures that users can produce consistent results across different types of analysis.
At Spark's core is an abstraction layer called Resilient Distributed Datasets, or RDDs. RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs. RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence. They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.
For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase). Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.
Spark's machine learning library, MLLib, is rapidly growing. In the latest release it includes linear support vector machines and logistic regression for binary classification; linear regression; k-means clustering; and alternating least squares for collaborative filtering. Linear regression, logistic regression and support vector machines are all based on a gradient descent optimization algorithm, with options for L1 and L2 regularization. MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).
GraphX, Spark's graph engine, is currently in beta. GraphX combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework. It enables users to interactively load, transform, and compute on massive graphs. Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.
Spark Streaming offers an additional abstraction called discretized streams, or DStreams. DStreams are a continuous sequence of RDDs representing a stream of data; they are created from live incoming data or generated by transforming other DStreams. Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.
Currently, Spark supports programming interfaces for Scala, Java and Python. For R users, there is good news: an R interface is in the works and under development by the team at AMPLab; our sources tell us this is expected to be released in the first half of 2014.
There is an active and growing developer community for Spark, as shown in the chart below. In the past six months, more than one hundred developers have contributed to Spark; in the same time period, just three developers actively contributed to Mahout.
Spark developers contributed almost five thousand commits in the past six months, more than all of the other projects shown in the chart combined. In 2013, the Spark project has published seven double-dot releases, including Spark 0.8.1 published on December 19; this latest release includes YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.
In a nod to Spark's rapid progress, Cloudera recently announced that it plans to distribute Spark in CDH5, and will partner with Databricks to provide commercial support. This announcement resonated in the analyst community; Derrick Harris of GigaOm, for example, wrote that “Spark is a really big deal for Big Data, and Cloudera gets it.”
Recently, the first Spark Summit attracted more than 450 participants from more than 180 companies. Presentations covered a range of applications such as neuroscience, audience expansion, real-time network optimization and real-time data center management, together with a range of technical topics.
In 2014, Spark's organizers expect to achieve top-level status in Apache. Developers expect to continue adding machine learning features and to simplify implementation. Together with an R interface and commercial support, we can expect continued interest and application for Spark.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.