Data Mining Standard Process across Organizations
[This article was first published on Data Perspective, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I have come across a term, CRISP-DM – a data mining standard. Though this process is not a new one but I felt every analyst should know about commonly used Industry wide process. In this post I will explain about different phases involved in creating a data mining solution.
CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.
CRISP-DM model is a phased approach to tackle a business problem. Different phases involved in the model are defined below:
Use case Identification:This is the initial phase of CRISP-DM in which a potential business problem is formulated into a Data mining use case. Various levels of brainstorming sessions are conducted between different stakeholders to define the problem statement, its impact on the business and a clear objective of the solution and its timelines.CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.
CRISP-DM model is a phased approach to tackle a business problem. Different phases involved in the model are defined below:
- Use case Identification
- Business Understanding
- Data Acquisition and Data Understanding
- Data Preparation
- Exploratory Analysis
- Data Modeling
- Data Evaluation
- Deployment
Audience:
- higher management
- IT teams – Application team, DBA team
- Analytics team – Data Scientist
Audience:
- domain experts – for domain knowledge, business rules understanding
- IT teams – for data sources identifications, key features of the system
- Analytics teams – Data Scientist
Data Preparation:This phase of CRISP-DM involves preparing data required to be fed into data mining algorithms. This Phase involves processing or cleaning of raw data. This is one of the crucial steps in data mining. The accuracy of the data mining solution depends on the quality of the data. All the data preparation activities which are required for creating final dataset for feeding into algorithms are done here – Handling missing data using methods such as imputations, converting data into proper formats such as unstructured to structured format, identifying outliers, normalizing the data etc.
Audience:
- Data Analytics team
Performing an Exploratory analysis helps us:
- To understand causes of an observed event.
- To understand the nature of the data we are dealing with.
- Assess assumptions on which our analysis will be based.
- To identify the key features in the data needed for the analysis.
Data Modeling:In this phase, various modeling techniques are selected and applied to the data for feature extractions, to model the data, tune the model and to calibrate its parameters to optimal values. Typically this phase involves applying suitable data mining/machine learning algorithms to the dataset. Some problems can be solved using single methods where as some problems involves combination of multiple techniques.
For ex: A recommendation systems of Netflix uses a combination of Boltzman machines, Gradient Boosted Decision trees, logistic regression etc.
Also sometimes different methods are applied separately to select the optimal method to solve the issue at hand.
For ex: Logistic regression, decision tree, Random forests are applied to the dataset to see which model will result in optimal data model.
In this phase of modeling the data, the dataset is divided into two sets, Training Set & Test Set. The modeling the data is done using Training Set and the Test Set is used to evaluate the model.
Data Evaluation: This is the follow-up step to the data Modeling phase. Data Model built in the previous step needs to be thoroughly validated before moving into deployment. The model should address all the business objectives mentioned in the problem statement. The Test Data set created in the previous set is used to test the model build. The objective of this step is to check if the prediction error made on the test set. If the prediction error is less, then our model is good to go. Sometimes the error would be larger indicating the situation of under fitting and Overfitting. Based on the results we might have to go back to previous phases and tune the model.
Deployment:Once the model building and evaluation is completed and we are satisfied with results, the next step is to present the business users with the results. These publishing results should be in user readable or understandable form. Most of the time the results will be published in the form of reports or UI. For example: If the results are needed by the top management for taking key business decisions, visualization reports will be the accurate. If the end user needs to be recommended any new item on e-commerce website, then the results should be displayed on to the web UI.
Most of the time, back and forth between phases is required. For example, during evaluating the data model, if we find that model is suffering from over-fitting we can go back to the model phase and fine tune the Model. As an another example, if in modeling phase if we observe that the a feature column in the dataset with sparse data is very critical in achieving the solution then we will go back to the Business Understanding step and consult the domain experts to know if we can derive more information about the sparse data column and impute the column with relevant values.
To know more information about CRISP-DM, see the wiki page here.
To leave a comment for the author, please follow the link and comment on their blog: Data Perspective.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.