Articles by statcompute

R and MongoDB

June 7, 2013 | statcompute

MongoDB is a document-based noSQL database. Different from the relational database storing data in tables with rigid schemas, MongoDB stores data in documents with dynamic schemas. In the demonstration below, I am going to show how to extract data from a MongoDB with R. Before starting the R session, we ... [Read more...]

Grid Search for Free Parameters with Parallel Computing

June 1, 2013 | statcompute

In my previous post (http://statcompute.wordpress.com/2013/05/25/test-drive-of-parallel-computing-with-r) on 05/25/2013, I’ve demonstrated the power of parallel computing with various R packages. However, in the real world, it is not straight-forward to utilize these powerful tools in our day-by-day computing tasks without carefully formulate the problem. In the example below, ... [Read more...]

Rmagic, A Handy Interface Bridging Python and R

May 31, 2013 | statcompute

Rmagic (http://ipython.org/ipython-doc/dev/config/extensions/rmagic.html) is the ipython extension that utilizes rpy2 in the back-end and provides a convenient interface accessing R from ipython. Compared with the generic use of rpy2, the rmagic extension allows users to exchange objects between ipython and R in a ... [Read more...]

Import All Text Files in A Folder with Parallel Execution

May 26, 2013 | statcompute

Sometimes, we might need to import all files, e.g. *.txt, with the same data layout in a folder without knowing each file name and then combine all pieces together. With the old method, we can use lapply() and do.call() functions to accomplish the task. However, when there are ... [Read more...]

Test Drive of Parallel Computing with R

May 25, 2013 | statcompute

Today, I did a test run of parallel computing with snow and multicore packages in R and compared the parallelism with the single-thread lapply() function. In the test code below, a data.frame with 20M rows is simulated in a Ubuntu VM with 8-core CPU and 10-G memory. As the ... [Read more...]

Conversion between Factor and Dummies in R

May 18, 2013 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. data(iris) str(iris) # OUTPUT: # 'data.frame': 150 obs. of 5 variables: # $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... # $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... # $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... # $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... # $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ### CONVERT THE FACTOR TO DUMMIES ### library(caret) dummies <- predict(dummyVars(~ Species, data = iris), newdata = iris) head(dummies, n = 3) # OUTPUT: # Species.setosa Species.versicolor Species.virginica # 1 1 0 0 # 2 1 0 0 # 3 1 0 0 ### CONVERT DUMMIES TO THE FACTOR ### header <- unlist(strsplit(colnames(dummies), '[.]'))[2 * (1:ncol(dummies))] species <- [...] [Read more...]

A Prototype of Monotonic Binning Algorithm with R

May 4, 2013 | statcompute

I’ve been asked many time if I have a piece of R code implementing the monotonic binning algorithm, similar to the one that I developed with SAS (http://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) and with Python (http://statcompute.wordpress.com/2012/12/08/monotonic-binning-with-python). Today, I finally had time to draft a quick ... [Read more...]

Disaggregating Annual Losses into Each Quarter

April 23, 2013 | statcompute

In loss forecasting, it is often necessary to disaggregate annual losses into each quarter. The most simple method to convert low frequency to high frequency time series is interpolation, such as the one implemented in EXPAND procedure of SAS/ETS. In the example below, there is a series of annual ... [Read more...]

A Grid Search for The Optimal Setting in Feed-Forward Neural Networks

February 3, 2013 | statcompute

The feed-forward neural network is a very powerful classification model in the machine learning content. Since the goodness-of-fit of a neural network is majorly dominated by the model complexity, it is very tempting for a modeler to over-parameterize the neural network by using too many hidden layers or/and hidden ... [Read more...]

Another Benchmark for Joining Two Data Frames

January 29, 2013 | statcompute

In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, ... [Read more...]

Efficiency in Joining Two Data Frames

January 28, 2013 | statcompute

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work. For smaller data frames with 1,000 rows, all six methods ... [Read more...]

PART – A Rule-Learning Algorithm

January 11, 2013 | statcompute

[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. > require('RWeka') > require('pROC') > > # SEPARATE DATA INTO TRAINING AND TESTING SETS > df1 <- read.csv('credit_count.csv') > df2 <- df1[df1$CARDHLDR == 1, 2:12] > set.seed(2013) > rows <- sample(1:nrow(df2), nrow(df2) - 1000) > set1 <- df2[rows, ] > set2 <- df2[-rows, ] > > # BUILD A PART RULE MODEL > mdl1 <- PART(factor(BAD) ~., data = set1) > print(mdl1) PART decision list ------------------ EXP_INC > 0.000774 AND AGE > 21.833334 AND INCOME > 2100 AND MAJORDRG <= 0 AND OWNRENT > 0 AND MINORDRG <= 1: 0 (2564.0/103.0) AGE > 21.25 AND EXP_INC > 0.000774 AND INCPER > 17010 AND INCOME > 1774.583333 AND MINORDRG <= 0: 0 (2278.0/129.0) AGE > 20.75 AND EXP_INC > 0.016071 AND OWNRENT > 0 AND SELFEMPL > 0 AND EXP_INC <= 0.233759 AND MINORDRG <= 1: 0 (56.0) AGE > 20.75 AND EXP_INC > 0.016071 AND SELFEMPL <= 0 AND OWNRENT > [...] [Read more...]

Efficiecy of Extracting Rows from A Data Frame in R

January 1, 2013 | statcompute

In the example below, 552 rows are extracted from a data frame with 10 million rows using six different methods. Results show a significant disparity between the least and the most efficient methods in terms of CPU time. Similar to the finding in my previous post, the method with data.table package ... [Read more...]

Modeling in R with Log Likelihood Function

December 30, 2012 | statcompute

Similar to NLMIXED procedure in SAS, optim() in R provides the functionality to estimate a model by specifying the log likelihood function explicitly. Below is a demo showing how to estimate a Poisson model by optim() and its comparison with glm() result. [Read more...]

Surprising Performance of data.table in Data Aggregation

December 28, 2012 | statcompute

data.table (http://datatable.r-forge.r-project.org/) inherits from data.frame and provides functionality in fast subset, fast grouping, and fast joins. In previous posts, it is shown that the shortest CPU time to aggregate a data.frame with 13,444 rows and 14 columns for 10 times is 0.236 seconds with summarize() in Hmisc ... [Read more...]

More about Aggregation by Group in R

December 24, 2012 | statcompute

Motivated by my young friend, HongMing Song, I managed to find more handy ways to calculate aggregated statistics by group in R. They require loading additional packages, plyr, doBy, Hmisc, and gdata, and are extremely user-friendly. In terms of CPU time, while the method with summarize() is as efficient as ... [Read more...]

Aggregation by Group in R

December 23, 2012 | statcompute

Efficiency Comparison among 4 Methods above [Read more...]

Data Import Efficiency – A Case in R

December 23, 2012 | statcompute

Below is a piece of R snippet comparing the data import efficiencies among CSV, SQLITE, and HDF5. Similar to the case in Python posted yesterday, HDF5 shows the highest efficiency. [Read more...]

Removing Records by Duplicate Values in R – An Efficiency Comparison

December 20, 2012 | statcompute

After posting “Removing Records by Duplicate Values” yesterday, I had an interesting communication thread with my friend Jeffrey Allard tonight regarding how to code this in R, a combination of order() and duplicated() or sqldf(). Afterward, I did a simple efficiency comparison between two methods as below. The comparison result ... [Read more...]

Removing Records by Duplicate Values

December 20, 2012 | statcompute

Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively. SAS Example R Example Python Example [Read more...]

« 1 … 5 6 7 8 »

Articles by statcompute

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)