SparkR: Distributed data frames with Spark and R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is now integrated with Apache Spark, the open-source cluster computing framework. The Databricks blog announced this week that yesterday's release of Spark 1.4 would include SparkR, “an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell”. The SparkR 1.4 announcement led with the news:  

Spark 1.4 introduces SparkR, an R API for Spark and Spark’s first new language API since PySpark was added in 2012. SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON. SparkR DataFrames support all Spark DataFrame operations including aggregation, filtering, grouping, summary statistics, and other analytical functions. They also supports mixing-in SQL queries, and converting query results to and from DataFrames. Because SparkR uses the Spark’s parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs.

This announcement is great news if you'd like to use the flexibility of the R language to perform fast computations on very large data sets. Unlike map-reduce in Hadoop, Spark is a distributed framework specifically designed for interactive queries and iterative algorithms. The Spark DataFrame abstraction is a tabular data object similar to R's native data.frame, but stored in the cluster environment. This conceptual similarity lends itself to elegant processing from R, using a syntax similar to dplyr. The SparkR blog post provides this example, which should seem familiar to dplyr users: 

# Download Spark 1.4 from http://spark.apache.org/downloads.html
#
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv
# Launch SparkR using
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")
# Print the first few rows
head(flights)
# Run a query to print the top 5 most frequent destinations from JFK
jfk_flights <- filter(flights, flights$origin == "JFK")
# Group the flights by destination and aggregate by the number of flights
dest_flights <- agg(group_by(jfk_flights, jfk_flights$dest), count = n(jfk_flights$dest))
# Now sort by the `count` column and print the first few rows
head(arrange(dest_flights, desc(dest_flights$count)))
## dest count
##1 LAX 11262
##2 SFO 8204
##3 BOS 5898
# Combine the whole query into two lines using magrittr
library(magrittr)
dest_flights <- filter(flights, flights$origin == "JFK") %>% group_by(flights$dest) %>% summarize(count = n(flights$dest))
arrange(dest_flights, desc(dest_flights$count)) %>% head

This new R package will give R users access to many of the benefits of the Spark framework, including the ability to import data from many sources into Spark, where it can be analyzed using the optimized Spark DataFrame system. Spark computations are automatically distributed across all the cores and machines available on the Spark cluster, so this package can be used to analyze terabytes of data using R.

Databricks blog: Announcing SparkR: R on Spark 

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)