Site icon R-bloggers

Big Data analytics with RevoScaleR Exercises-2

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


In the last set of exercises , you have seen the basic functionalities of RevoScaleR .In this exercise set we will explore RevoScaleR further.
get the Credit card fraud data set from revolutionanalytics and lets get started
Answers to the exercises are available here.Please check the documentation before starting these exercise set

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
RevoScaleR provides option to convert a dataframe into a xdf file,which you might need while storing temporary data frame that you create during analysis work .
Now Create an XDF file from airquality dataset

Exercise 2
In the previous set of exercise you have seen rxHistogram briefly,Now we will see how to get meaningful information from large dataset with a visualization .
create a scatterplot from the xdf file for Balance vs numTrans where numTrans is greater than 50 and have more than 50 creditline and the fraudrisk is 1 .

Exercise 3
Good thing about RevoScaleR is that it comes with a sample data directory,find the sample data directory by rxGetoptions
and use claims.text and convert it into xdf file .You have already used rxImport ,which can be used but for plain text data there is an highly optimized method in RevoscaleR.Please use that

Exercise 4
if you check the sampledatadir you can see that there are 10 csv files like moredefaulsmall2000.csv etc . rxImport can create a single xdf by joining them row wise .
Please create a new xdf which will contain all the data from moredefaultsmall2000.csv to moredefaultsmall2010.csv.
You can write a loop or do it in a functional way , if you have followed my exercises on functional programming (1 & 2) , I hope you know how to achieve this without loop .

Exercise 5
We will see how rxDataStep can create new variables with complex calculations. on the creditcard fraud xdf create a column domTrans which is numTrans -numIntlTrans and create another
column balance per domTran which is Balance/domTrans. This exercise will show you how we can use temp variables to create more complex new columns from existing columns

Exercise 6
in rxDataStep an important feature is transform functions . One thing to remember that when you use a transform function ,the transform function sees a chunk of data at a time , so if you require any value for the whole dataset ,you need to create it before the transformfunction ,not within the transform function .
Now create the z score of balance in the credit card fraud xdf file,which you have already created in the first set of exercise.

The next 3 exercise will give you a basic idea on how to split xdf into training /test set , we will see a bit deeper into rxLinMod as well.Please remember that the goal
of this exercise is not to predict but to make you aware of few important details of modelling by revoscaleR

Exercise 7
split the credit card fraud xdf into two random parts with 25 percent being the test data and 75 percent is the training data.
Hint -You need to use splitByFactor and create it using transforms .Use a Seed to make this reproducible

< aside class='stb-icon'>
Learn more about importing big data in the online course Data Mining with R: Go from Beginner to Advanced. In this course you will learn how to
  • work with different data import techniques,
  • know how to import data and transform it for a specific moddeling or analysis goal,
  • and much more.

Exercise 8
In the last set you used rxLinMod ,use the same expression but use cube =true as a parameter . You may need to tweak around to make it work , as the first expression should be categorical when using cube . Can you see the difference between both models .

Exercise 9
How do you define a linear regression where you want to analyze whether fraudrisk depends only on the interaction of gender and balance and how do you define if you want to check fraudrisk’s dependency on the interaction as well as their individual contribution .

Exercise 10
You might want to check fraudrisk for a different segment of balance . Create 5 different segments of 0-10k,10k-20k and so on and check the Linear Regression result .

Related exercise sets:

  1. Big Data analytics with RevoScaleR Exercises
  2. Big Data Manipulation in R Exercises
  3. Density-Based Clustering Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.