Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Machine Learning Strategy is about how to tackle machine learning tasks strategically. In three blog posts I will try to give an introduction into this topic and I also hope for some comments and opinions on this topic.
< !--excerpt-->Task Definition
The first step of a machine learning task is usually to define the task itself. What is the purpose, what do we want to achieve? Sometimes the target is not so clear at the beginning.
After that, the next step is to think about how to achieve the task in the best way. Questions that arise are:
- Which and how much data should be collected or are available?
- How should the data be structured, transformed and divided?
- Which algorithm with which hyperparameters should be used?
How to solve these questions?
Target definition
The first step is to clearly define the target:
- What metric(s) do we want to optimize? (optimizing metrics)
- Under which constraints should they be optimized? (satisficing metrics)
- Are there observations for which we should give a stronger weight?
Example: We want to minimize the mean squared error under the constraint, that the runtime should be less than 5 minutes.
Data split
The next step is to divide the data into different parts. Usually the data is divided into three parts:
- Training data: Is used for training the algorithm
- Development data: Is used for evaluating the training algorithm iteratively
- Test data: Is used for the final evaluation of the trained algorithm
By using (repeated) cross-validation training and development data can be interchanged. E.g. in 5-fold cross-validation the data is divided in 5 parts and each part is once used as development data for evaluating the metric while the other parts are used for training. At the end one can e.g. take the mean of the results of the development data.
How to divide the data?
- Development and test data should represent the final data for which we train the algorithm.
- Ideally they should have the same probability distribution
Data sizes:
- Classical division: 60% training, 20 development, 20% test → This makes sense for small datasets (100-100 000 observations) with enough data for development and test data.
- For larger datasets (e.g. 1 000 000 observations) it might be enough to have smaller development and test data sets (e.g. 98%/1%/1%) or smaller training data sets (depending on the runtime and learning curve of the algorithm)
- The statistical field of sample planning can be used to estimate how much data has to be used (for properly evaluating metrics in the developing and testing data sets) → Guidelines: Use enough data such that the performance can be estimated good enough on developing and test data; The result should not be randomly good or bad.
- If only train and development data is used there is the danger of overfitting on the development data and the results are not generalizable
In the following blog post, I will post more about the possibilities of improving an algorithm once it has been trained with training data and how this can be done in an iterative process.
This blog post is partly based on information that is contained in a course about deep learning on coursera.org that I took recently. Hence, a lot of credit for this post goes to Andrew Ng that held this course.
Feel free to leave a comment below and share your experiences and opinions about this topic. How do you tackle machine learning problems strategically?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.