Site icon R-bloggers

Employee flight risk modeling behavior

[This article was first published on Stories Data Speak, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An analytical model for predicting employee flight risk behaviour

“People are the nucleus of any organization. So, how can you find, engage and retain top performers who’ll contribute to your goals, your future?”

There is no dearth of Enterprise Resource Planning (ERP) systems utilized by human resource companies, however, the inclusion of machine learning to such ERP systems can be very useful. This leads one to ask the following question.

A. Question

To develop a predictive model to understand the reasons why employees leave the organization.

B. Objectives

This report has two objectives, namely;

i. To conduct an exploratory data analysis for determining any possible relationship between the variables

ii. To develop a predictive model for identifying the potential employee attrition reasons.

C. Data Analysis

A systematic data analysis was undertaken to answer the business question and objective.

i. Exploratory Data Analysis (EDA)

The training set had 13000 observations in 11 columns. The test set had 1999 observations in 10 columns. There were zero missing values. I now provide the following observations;

Fig-1: Correlation plot

a. I renamed some variables like “sales” was renamed to “role”, “time_spend_company” was renamed to “exp_in_company”.

b. The employee attrition rate was 21.41%

c. The company had an employee attrition rate of 24%

d. The mean satisfaction of employees was 0.61

e. From the correlation plot shown in Fig-1, there is a positive (+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.

f. For the negative (-) relationships, employee attrition and satisfaction are highly correlated. Probably people tend to leave a company more when they are less satisfied.

g. A one-sample t-test was conducted to measure the satisfaction level.

  1. Hypothesis Testing: Is there significant difference in the means of satisfaction level between attrition and the entire employee population?

1.1. Null Hypothesis: (H0: pEmployeeLeft = pEmployeePop) The null hypothesis would be that there is no difference in satisfaction level between attrition and the entire employee population.

1.2. Alternate Hypothesis: (HA: pEmployeeLeft!= pEmployeePop) The alternative hypothesis would be that there is a difference in satisfaction level between attrition and the entire employee population.

Findings

I then conducted a t-test at 95% confidence level to see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population.

Findings

Inference

From the above findings does not necessarily mean the findings are of practical significance because of two reasons, namely; collect more data or conduct more experiments.

h. Now let’s look at some distribution plots using some of the employee features like “Satisfaction”, “Evaluation” and “Average monthly hours”.

Summary: Let’s examine the distribution on some of the employee’s features.

Here’s what I found:

i. The relationship between Salary and Attrition

Fig-2: Salary vs Attrition plot

j. The relationship between Department and Attrition

Fig-3: Department vs Attrition plot

k. The relationship between Attrition and ProjectCount

Fig-4: Project count vs Attrition plot

l. The relationship between Attrition and Evaluation

Fig-5: Employee evaluation vs Attrition plot

m. The relationship between Attrition and AverageMonthlyHours

Fig-6: Average monthly hour worked vs Attrition plot

Key Observations: The Fig-7, clearly represents the factors which serve as the top reasons for attrition in a company:

Fig-7: Feature importance plot

  1. Data modeling

Base model rate: recall back to Part 4.1: Exploring the Data, 24% of the dataset contained 1’s (employee who left the company) and the remaining 76% contained 0’s (employee who did not leave the company). The Base Rate Model would simply predict every 0’s and ignore all the 1’s. The base rate accuracy for this data set, when classifying everything as 0’s, would be 76% because 76% of the dataset are labeled as 0’s (employees not leaving the company). The training data was split into 75% train set and 25% validation set. An initial logistic regression model based on all 10 independent variables (or features) was built on the train set. The model was tested on the validation set. An initial predictive accuracy of 78% was obtained.

Thereafter, I built four models based on the following classifiers, namely:

a. Classification And Regression Trees (CART),

b. Support Vector Machine (SVM),

c. k-nearest neighbor (knn) and

d. logistic regression

The CART model on the validation set gave an accuracy of 97% while the knn model gave an accuracy of 99.79%. See Fig-8.

Fig-8: Predictive modeling results

I chose the knn model as the final model. And I tested this model on the hr_attrition_test data. As shown in Fig-8, the knn model has the highest accuracy and the kappa statistic. Finally to conclude using the knn modeling technique, we can predict the employee attrition at an accuracy of 99.79%.

Summary

Code and Dataset

To leave a comment for the author, please follow the link and comment on their blog: Stories Data Speak.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.