Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< size="2" face="arial,helvetica,sans-serif">< >< size="2" face="arial,helvetica,sans-serif">MapReduce, the heart of Hadoop, is a programming framework that enables massive scalability across servers using data stored in the Hadoop Distributed File System (HDFS). The Oracle R Connector for Hadoop (ORCH) provides access to a Hadoop cluster from R, enabling manipulation of HDFS-resident data and the execution of MapReduce jobs.
Conceptutally, MapReduce is similar to combination of apply operations in R or GROUP BY in Oracle Database: transform elements of a list or table, compute an index, and apply a function to the specified groups. The value of MapReduce in ORCH is the extension beyond a single-process to parallel processing using modern architectures: multiple cores, processes, machines, clusters, data appliance, or clouds.
ORCH can be used on the Oracle Big Data Appliance or on non-Oracle Hadoop clusters. R users write mapper and reducer functions in R and execute MapReduce jobs from the R environment using a high level interface. As such, R users are not required to learn a new language, e.g., Java, or environment, e.g., cluster software and hardware, to work with Hadoop. Moreover, functionality from R open source packages can be used in the writing of mapper and reducer functions. ORCH also gives R users the ability to test their MapReduce programs locally, using the same function call, before deploying on the Hadoop cluster. < >
< size="2" face="arial,helvetica,sans-serif">< >< size="2" face="arial,helvetica,sans-serif">
In the following example, we use the ONTIME_S data set typically installed in Oracle Database when Oracle R Enterprise is installed. ONTIME_S is a subset of the airline on-time performance data
(from Research and Innovative Technology Administration (RITA), which
coordinates the U.S. Department of Transportation (DOT) research
programs. < >< size="2" face="arial,helvetica,sans-serif">We’re providing a relatively large sample data set (220K rows), but this example could be performed in ORCH on the full data set,
which contains 123 millions rows and requires 12 GB disk space . This
data set is significantly larger than R can process on it’s own using a typical laptop with 8 GB RAM.< >
< size="2" face="arial,helvetica,sans-serif">
ONTIME_S is a database-resident table with metadata on the R side, represented by an < face="courier new,courier,monospace">ore.frame< > object.
< >< size="2">< >< size="2" face="arial,helvetica,sans-serif">> class(ONTIME_S)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
ORCH includes functions for manipulating HDFS data. Users can move data between HDFS and the file system, R data frames, and Oracle Database tables and views. This next example shows one such function, < face="courier new,courier,monospace">hdfs.push< >, which accepts an < face="courier new,courier,monospace">ore.frame< > object as its first argument, followed by the name of the key column, and then the name of the file to be used within HDFS.
ontime.dfs_DB <- hdfs.push(ONTIME_S,
key=’DEST’,
dfs.name=’ontime_DB’)
The following R script example illustrates how users can attach to an existing HDFS file object, essentially getting a handle to the HDFS file. < >< size="2" face="arial,helvetica,sans-serif">Then, using the < face="courier new,courier,monospace">hadoop.run< > function in ORCH, we specify the HDFS file
handle, followed by the mapper and reducer functions. < >< size="2" face="arial,helvetica,sans-serif">The mapper function takes the key and value as arguments, which correspond to one row of data at a time from the HDFS block assigned to the mapper. The function keyval in the mapper returns data to Hadoop for further processing by the reducer.
< >
< size="2" face="arial,helvetica,sans-serif">The reducer function receives all the values associated with one key (resulting from the “shuffle and sort” of Hadoop processing). The result of the reducer is also returned to Hadoop using the keyval function. The results of the reducers are consolidated in an HDFS file, which can be obtained using the hdfs.get function.
The following example computes the average arrival delay for flights where the destination is San Francisco Airport (SFO). It selects the SFO airport in the mapper and the mean of arrival delay in the reducer.
< face="courier new,courier,monospace">dfs <- hdfs.attach("ontime_DB")< >< >
< size="2" face="arial,helvetica,sans-serif">< face="courier new,courier,monospace">res <- hadoop.run(
dfs,
mapper = function(key, value) {
if (key == ‘SFO’ & !is.na(x$ARRDELAY)) {
keyval(key, value)
}
else {
NULL
}
},
reducer = function(key, values) {
for (x in values) {
sumAD <- sumAD + x$ARRDELAY
count <- count + 1
}
res <- sumAD / count
keyval(key, res)
})
> hdfs.get(res)
key val1
1 SFO 17.44828< >
< >
< size="2" face="arial,helvetica,sans-serif">Oracle R Connector for Hadoop < >< size="2" face="arial,helvetica,sans-serif">is part of the Oracle Big Data Connectors software suite < >< size="2" face="arial,helvetica,sans-serif">and is supported for Oracle Big Data Appliance and Oracle R Enterprise
customers. < >< size="2" face="arial,helvetica,sans-serif">We encourage you
download Oracle software for evaluation from the Oracle Technology
Network. See these links for R-related software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop. We welcome comments and questions on the Oracle R Forum.
< >
< >
< size="2" face="arial,helvetica,sans-serif">< >< size="2" face="arial,helvetica,sans-serif">
< >
< size="2" face="arial,helvetica,sans-serif"> < >
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.