[This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I’ve been doing a lot of work with predictive models using logistic regression. Logistic regression is great for determing probable outcomes of a independent binary target variable. R is a great tool for accomplishing this task. Often times I will use the base function glm to develop a model. Yet there are times, due to the hardware or software memory restrictions, that the usual glm function is not enough get the job done.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A great alternative to performing usual logistic regression analyses on big data is using the biglm package. Biglm performs the same regression optimization but processes the data in “chunks” at a time. This allows R to only perform calculations on smaller data sets without the need for large memory allocations to the computer. Biglm also has an interesting option that it not only can perform calculations on imported dataframes and text files but also database connectivity. This is where the helpful package RODBC comes in to the aid.
I have be looking all over the R support lists and blogs in hopes of finding a good tutorial using biglm and RODBC. I was not successful yet I was able to find out how to perform this myself.
The first step is to establish an ODBC source to a database. In this example I am using a Windows OS environment and connecting to a MS SQL Server. An odbc source must first be setup on the computer. This is usually done in the Windows Control Panel. Once that is done then RODBC can be used to establish a connection. My example is an odbc data source name called “sqlserver”.
library(RODBC)
myconn <- odbcConnect(sqlserver)
Now an ODBC connection object is established. Queries can now be submitted to the SQL Server via the sqlQuery function which is what we will use as the data source. SQL scripts can be the typical select statements.
sqlqry <- “select myvars, targetvar from mytable”
Next is to use the bigglm function to perform the logistic regression.
library(biglm)
fit <- bigglm(targetvar ~ myvars, data=sqlQuery(myconn, sqlqry), family=binomial(), chunksize=100, maxit=10)
summary(fit)
The data is being pulled from the SQL Server via the sqlQuery function from the RODBC package. The bigglm will recognize the sqlQuery data as a dataframe. The chunksize specifies the number of lines to process at any time. The maxit value specifies the maximum number of Fisher scoring iterations.
To leave a comment for the author, please follow the link and comment on their blog: Maximize Productivity with Industrial Engineer and Operations Research Tools.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.