Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to last part of the series post again! In previous part I discussed about the solutions to the questions mentioned in first part. In this part, we will implement whole scenario using R and MySQL together and see how we can process bigdata(computationally
Let us recall those questions and summarize their answers to get started with implementation part.
Question-1 How to optimize our predictive analysis computation for big data when we are having limited computational resource?
Answer – Make changes into gradient descent algorithm and process data into chunks into convergence process.
Question-2 What could be done in order to process large data with limited memory?
Answer – Use MySQL to process data into chunks from database and provide it to optimization algorithm (gradient descent) to build predictive model on top of it.
For learning purpose, I am here just giving the idea of implementation. This implementation would differ in different scenarios. Assume that we have dataset stored in database and having following variables for building predictive model
y : Response variable
x1 : Predictor-1
x2 : Predictor-2
We want to fit linear regression model on this data. There are more than 10,000,000,000 rows in the database and we can process only 10,000,000 at a time with given memory.
Now let us see the R code for connecting the database in R
#Connecting Database #Load Library library(RJDBC) #install RJDBC package if not installed #Set Driver #Make sure you have jar for mysaql and java connector and placed in working directory drv <- JDBC("com.mysql.jdbc.Driver","mysql-connector-java-5.1.18.jar",identifier.quote="`") #Establish connection(database name is test) conn <- dbConnect(drv, "jdbc:mysql://localhost/test", "root", "root")
Obtain total number of rows from the table “data” where actually data is stored.
#Total rows totalRows <- dbGetQuery(conn, "select count(*) as total from data")
Set up chunk size and calculate total no of loops required to process whole data
#set chunk size library(gmp) chunkSize <- as.bigz(10000000) #Total number of loops totalLoops <- floor(totalRows$total/as.numeric(chunkSize))
Now set parameters for gradient descent algorithm
m <- totalRows$total vars <- dbGetQuery(conn, "SELECT COLUMN_NAME FROM information_schema.columns WHERE table_name = 'data' ORDER BY ordinal_position ") theta <- numeric(nrow(vars)) alpha <- 0.01
Now it’s time to implement loop structure defined in part two.
for(j in 1:50) { grad_n <- numeric(length(theta)) for(i in 0:totalLoops) { index <- i*chunkSize+1 query <- paste("select * from data limit ",index,",",chunkSize,sep="") data_compute <- dbGetQuery(conn,query) x_mat <- cbind(rep(1,nrow(data_compute)),data_compute$x1,data_compute$x2) y_mat <- data_compute$y grad_n <- grad_n + t(x_mat) %*% ((x_mat%*%theta)-y_mat) } theta <- theta - alpha*(1/m)*grad_n }
Once this whole process will be finished, we will be having best fitting theta parameters and it is the outcome of our predictive model building using linear regression. This process will take some time but it will process whole dataset and perform required calculation.
So using this implementation structure we will be able to process large dataset and also fit the regression model on it. Here I want you to think upon how we can make this structure with parallel execution if we have multiple computational units. Share your thoughts on this in comment section
I hope you liked reading the post and understood the whole concept.
Finally thanks for reading the post. If you liked the post, don’t forget to share it (social sharing panel is shown left side
Till that time stay tuned and enjoy reading the other posts on R!
Powered by Google+ Comments
The post Build Predictive Model on Big data: Using R and MySQL Part-3 appeared first on Pingax.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.