How To: 20 Minute Guide to Get Started with PivotalR

Hai Qian

8 years ago

[This article was first published on Pivotal P.O.V. » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Courtesy Indeed.com

According to fairly reliable job trends, R’s growth in the past few months is up significantly—over an order of magnitude compared to two close competitors that are in decline—SAS and SPSS.

Not only do internet giants like Google, Twitter, Facebook, Bing rely on R for daily analysis of big data, but data savvy companies of all sizes and focuses also turn to R for insight. Companies from the car sharing startup Uber, to retail giants such as Nordstrom and Loyd’s, to automotive stalwarts like Ford, to government agencies like the FDA, and banking visionaries like Bank of America all use R for statistical analysis.

Part of R’s power lies in that it has been developed by experts, is very comprehensive, and sports serious graphics capabilities. Due to its open source nature, a broad group of users have contributed over 4800 packages covering topics like econometrics, spatial analysis, bioinformatics, and more. Its adoption is also helped by the fact that it works across virtually every data store you might be interested in.

However, as discussed in a previous blog post, one of the shortcomings of R is that it is not natively equipped for big data. As part of the larger big data ecosystem, Pivotal has created its own package for R that eases its use with Pivotal products such as Greenplum Database, Pivotal HD and HAWQ.

Below, we will provide some background on PivotalR, help you install the system, load data, and explain important data manipulations.

Why PivotalR?

PivotalR helps users of R to specifically interact with the Pivotal Greenplum Database, Pivotal HD / HAWQ and the open-source database PostgreSQL. It does so by providing an interface to the operations on tables/views in the database. These operations are almost the same as those of data.frame, which is used to store data tables or collections of variables and similar to matrices or lists. It is important to note that a minimal amount of data is transferred between R and the database system, increasing efficiency. Also, users of R do not need to learn SQL when they operate on the objects in the database because PivotalR handles the translation.

PivotalR also allows users to run the functions of the open-source big-data machine learning package MADlib directly from R.

Getting Started

PivotalR is the interface to the database and MADlib. To use it, we need to install a database, MADlib package, and PivotalR.

1. Installing the database

Right now, PivotalR supports the connection to PostgreSQL, Greenplum Database (GPDB), and Pivotal HD’s HAWQ. In the supported platforms, both GPDB and HAWQ are massively parallel processing (MPP) database systems. Although we cannot run parallel computations on PostgreSQL, it is still very useful. Deploying MADlib and PivotalR on PostgreSQL gives the user to big data that cannot be fully loaded into the memory, although it will not get the same speedup as in a distributed environment like GPDB or HAWQ.

2. Installing MADlib

One can get the MADlib installation package and install it on the database.

3. Installing PivotalR

One can either install PivotalR from CRAN or GitHub. The current version on CRAN is 0.1.16.12.

install.packages("PivotalR")

or install it from GitHub

install.packages("devtools") devtools::install_github("PivotalR", "gopivotal")

The latest PivotalR on GitHub contains an introductory paper, which has been accepted by R Journal and will appear in the next issue. To view this paper, run

vignette("pivotalr")

or view it on CRAN.

4. Start

Load the library and connect to a database.

library(PivotalR) db.connect(port = 5333, dbname = "madlib") ## Created a connection to database with ID 2 ## [1] 2

Loading Data for PivotalR

Before we can begin manipulating data, we need to load a data set. We will use the data set named “abalone” that is included in the package.

&# delete("abalone") # dat0 dat ## Counting and caching the data table dimension ... 0.022 sec ... done. ## An R object pointing to "abalone" in connection 1 is created ! names(dat) # column names ## [1] "id" "sex" "length" "diameter" "height" "whole" ## [7] "shucked" "viscera" "shell" "rings" dim(dat) # dimension of the table ## [1] 60 10 dat.mem

Here, the second argument of lk function specifies how many rows is loaded, and -1 means loading the whole result.

Data Manipulations in PivotalR

Understanding the Data Structure

PivotalR defines a db.data.frame class. The objects of this class are the wrappers of the data table in the database. Almost all data.frame operations can be applied to this object, which looks and behaves like data.frame. This is why we call the class db.data.frame. However, any operations on a db.data.frame are not evaluated in the database immediately. Instead, a db.Rquery object is generated, which can be used in the way of db.data.frame objects. One can choose to evaluate a db.Rquery object x at any time by calling lk(x) to look at the results of it. Both db.data.frame and db.Rquery objects can be directly used with MADlib machine learning function wrappers, making a very powerful combination.

Although a db.data.frame object behaves like a data.frame object, there is a major difference. The data table in the database is essentially a set, which does not have an order on the rows. On the other hand, R’s own data.frame has an intrinsic order of the rows. For example, in the above code, there is no guarantee that you get the same order of rows if you run the command dat.mem <- lk(dat, -1) multiple times. If you want to fix the order of rows loaded into R memory, you can use

dat.mem <- lk(sort(dat, FALSE, dat$id), -1)

where sort function produces a db.Rquery object which contains the SQL query for sorting the data, and lk function evaluates it in the database.

1. Extraction & Replacement

PivotalR supports the extraction and replacement syntax of data.frame. For example, add a new column named grp, which is the remainder

dat$grp <- dat$id %% 3

At this point, dat becomes a db.Rquery object. The new column grp is derived from the column id. Its value is not evaluated yet.

2. Joins

The function for data.frame that corresponds to SQL’s join is merge. PivotalR supports the same function on data table wrappers.

merge.self names(merge.self) ## [1] "id" "sex_x" "length_x" "diameter_x" "height_x" ## [6] "whole_x" "shucked_x" "viscera_x" "shell_x" "rings_x" ## [11] "grp_x" "sex_y" "length_y" "diameter_y" "height_y" ## [16] "whole_y" "shucked_y" "viscera_y" "shell_y" "rings_y" ## [21] "grp_y"

The merge function joins two tables. If the two tables have columns with the same names, it is automatically taken care of.

3. Split-Apply-Combine

x lk(x) ## sex id_avg sex_avg length_avg diameter_avg height_avg whole_avg ## 1 F 1996 0.6164 0.4827 0.1675 1.2255 ## 2 M 2079 0.5593 0.4370 0.1568 1.0058 ## 3 I 2162 0.4378 0.3336 0.1086 0.4559 ## shucked_avg viscera_avg shell_avg rings_avg grp_avg ## 1 0.5082 0.27609 0.3735 11.14 0.9545 ## 2 0.4370 0.21925 0.2803 10.65 0.7500 ## 3 0.2005 0.09906 0.1372 7.50 0.6667

The by function groups the data using the values of dat$sex and applies the function mean to each group. Again, x is not evaluated until we apply lk function on it to look at the real values.

4. Categorical variables

PivotalR also supports basic operations of categorical variables. The factors can be used in MADlib function wrappers.

dat$grp |t|) ## (Intercept) 6.6312 2.3921 2.77 0.0077 ** ## length -8.5634 15.5409 -0.55 0.5840 ## diameter 13.3174 20.3185 0.66 0.5151 ## shell 9.0381 4.6892 1.93 0.0594 . ## sex.I -1.0243 1.0017 -1.02 0.3112 ## sex.M 0.5216 0.8055 0.65 0.5201 ## grp.1 -0.1991 0.8155 -0.24 0.8081 ## grp.2 0.0144 0.7912 0.02 0.9855 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## R-squared: 0.4505 ## Condition Number: 110.5

Here both sex and grp are categorical variables, and madlib.lm is the wrapper function for MADlib’s linear regression. The function automatically takes care of the categorical variables.

5. Columns that have array values

## Create a column with array values dat$arr lk(dat, 10) ## id sex length diameter height whole shucked viscera shell rings grp ## 1 3151 F 0.655 0.505 0.165 1.3670 0.5835 0.3515 0.3960 10 1 ## 2 2026 F 0.550 0.470 0.150 0.9205 0.3810 0.2435 0.2675 10 1 ## 3 3751 I 0.435 0.375 0.110 0.4155 0.1700 0.0760 0.1450 8 1 ## 4 720 I 0.150 0.100 0.025 0.0150 0.0045 0.0040 0.0050 2 0 ## 5 1635 F 0.575 0.470 0.155 1.1160 0.5090 0.2380 0.3400 10 0 ## 6 2648 I 0.500 0.390 0.125 0.5830 0.2940 0.1320 0.1605 8 2 ## 7 1796 F 0.580 0.430 0.170 1.4800 0.6535 0.3240 0.4155 10 2 ## 8 209 F 0.525 0.415 0.170 0.8325 0.2755 0.1685 0.3100 13 2 ## 9 1451 I 0.455 0.335 0.135 0.5010 0.2740 0.0995 0.1065 7 2 ## 10 1108 I 0.510 0.380 0.115 0.5155 0.2150 0.1135 0.1660 8 1 ## arr.1 arr.2 arr.3 arr.4 arr.5 arr.6 arr.7 arr.8 ## 1 0.6550 0.5050 0.1650 1.3670 0.5835 0.3515 0.3960 10.0000 ## 2 0.5500 0.4700 0.1500 0.9205 0.3810 0.2435 0.2675 10.0000 ## 3 0.4350 0.3750 0.1100 0.4155 0.1700 0.0760 0.1450 8.0000 ## 4 0.1500 0.1000 0.0250 0.0150 0.0045 0.0040 0.0050 2.0000 ## 5 0.5750 0.4700 0.1550 1.1160 0.5090 0.2380 0.3400 10.0000 ## 6 0.5000 0.3900 0.1250 0.5830 0.2940 0.1320 0.1605 8.0000 ## 7 0.5800 0.4300 0.1700 1.4800 0.6535 0.3240 0.4155 10.0000 ## 8 0.5250 0.4150 0.1700 0.8325 0.2755 0.1685 0.3100 13.0000 ## 9 0.4550 0.3350 0.1350 0.5010 0.2740 0.0995 0.1065 7.0000 ## 10 0.5100 0.3800 0.1150 0.5155 0.2150 0.1135 0.1660 8.0000

Sometimes, it is very useful to construct a column whose value is an array. Here, we create a new column arr, whose values are arrays of numbers. And we load 10 rows of the data into memory as a data.frame. This data.frame has a column arr, which is an array. And its elements are displayed as arr.1, arr.2, …, arr.8. The equivalent SQL query for constructing arr is SELECT ARRAY[length, diameter, height, whole, shucked, viscera, shell, rings] FROM abalone. The syntax 3:10 makes the selection of multiple columns very easy. If you had 1000 columns and you want to construct an array using them, this R syntax would be very handy.

Computation that involves date, time and time intervals in the database are also supported in PivotalR. As well, regular expressions are supported in PivotalR.

Get Started with PivotalR

Now that you know how to install PivotalR with MADlib and perform some basic functions with machine learning or basic data table manipulations, it’s time to get started and learn more.

Attend Author’s upcoming talk at the useR 2014 conference, on July 1st from 16:00 – 17:30 within the business track of Session 3.
Install PostgreSQL, MADlib, and PivotalR
Read other articles from this blog about PivotalR, MADlib, Pivotal Greenplum DB, Pivotal HD, or data science
Check out more product information, documentation, or download Pivotal Greenplum DB and Pivotal HD/HAWQ

To leave a comment for the author, please follow the link and comment on their blog: Pivotal P.O.V. » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.