Site icon R-bloggers

What you need to know about data augmentation for machine learning

[This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Plentiful high-quality data is the key to great machine learning models. But good data doesn’t grow on trees, and that scarcity can impede the development of a model. One way to get around a lack of data is to augment your dataset. Smart approaches to programmatic data augmentation can increase the size of your training set 10-fold or more. Even better, your model will often be more robust (and prevent overfitting) and can even be simpler due to a better training set.

There are many approaches to augmenting data. The simplest approaches include adding noise and applying transformations on existing data. Imputation and dimensional reduction can be used to add samples in sparse areas of the dataset. More advanced approaches include simulation of data based on dynamic systems or evolutionary systems. In this post we’ll focus on the two simplest approaches: adding noise and applying transformations.

Regression Problems

In many datasets we expect that there is unavoidable statistical noise due to sampling and other factors. For regression problems, we can explicitly add noise to our explanatory variables. Doing so can make the model more robust, although we need to take care when constructing the noise term. First, we want to avoid adding bias. We also need to ensure the noise is independent.

In my function approximation example, I demonstrated creating a simple neural network in Torch to approximate a function. Let’s use the same dataset and function but this time add noise as a preprocessing step. Before, I generated the training data directly in Torch. However, in my simple workflow for deep learning, I said I prefer using R for everything but training the model. Hence, I generate and add noise in R and then write out a headerless CSV file. This process expands the training set from 40k samples to ~160k samples.

The following function accomplishes this and uses an arbitrary function to add noise. The default function adds uniform noise.

perturb_unif <- function(df, mult=4, fr=function(a) a + runif(length(a)) - .5) {
  fn <- function(i) data.frame(x=fr(df[,1]), y=fr(df[,2]), z=df[,3])
  o <- lapply(1:mult, fn)
  do.call(rbind, o)
}

Exercise: What is the purpose of the -0.5 term?
Exercise: What are some other valid noise functions? Why would you choose one over another?

I then read this file into Lua using a custom function loadTrainingSet, which is part of my deep_learning_ex guide. This function reads the CSV and creates a Torch-compatible table comprising the input and output tensors. The function simply assumes that the last column of the CSV is the output.

Using this approach, it’s possible to create a 20 hidden node network that performs as well as the 40 node network in the earlier post. Think about this: by adding noise (and increasing the size of the training set), we’ve managed to reduce the complexity of the network 2-fold.

Adding noise simplifies the model and makes it more robust

Kool-Aid Alert: Depending on the domain and hyper parameters chosen, this approach may not produce desirable results. As with most deep learning exercises, tuning of hyper parameters is mandatory.

Classification Problems

Noise can be used in classification problems as well. One particularly useful application is balancing a dataset. In a binary classification problem, suppose the data is split 80-20 between two classes. It is well known that such an unbalanced set is problematic for machine learning algorithms. Some will even just default to the class with the higher proportion of observations, since the naïve accuracy is reasonable. In small datasets balancing the dataset by trimming can be counterproductive. The alternative is to increase the samples in the smaller class. The same noise augmentation approach works well here.

Another group of classification problem is image classification. Following a similar approach, noise can be added to images, which can make the model more robust. Another popular technique is transforming the data. This makes sense for images since changes in perspective can change the apparent shape of an object. Transparent or reflective objects can also distort objects, but despite this distortion, we know the object to be the same. Affine transformations provide simple linear transforms that can expand a dataset. This includes shifting, scaling, rotating, flipping, etc. While a good starting point, andsome problems might benefit from more complex transformations.

Still a cat

Transformed images can be generated in a pre-processing step as above. The Torch image toolkit can be used this way. For example, here is how to flip an image horizontally.

img = image.load("cat.jpg",3,'byte')
img1 = image.hflip(img)

Original cat

Flipped cat

Alternatively, these variations can be generated inline during the training process. Keras uses this approach with the ImageDataGenerator class. This means that images are transformed on the fly during training.

Exercise: Which approach will produce better results? Why?

Conclusion

Just because you don’t have as much data as Google or Facebook doesn’t mean you should give up on machine learning. By augmenting your dataset, you can get excellent results with small data.

Use approaches not mentioned above to augment your data? Share in the comments below.


To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.