Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
πͺπ§A Picture is worth a thousand words
Problem Statement
The human capital department of a large corporation wants to know why their is a high employee turnover, they also want to understand which employees are more likely to leave, and why.
Aim and Objectives
- Which department has the highest employee turnover? Which one has the lowest?
- Investigate which variables seem to be better predictors of employee departure.
- Recommendations to help reduce employee turnover
Loading and Data
head(df) ## # A tibble: 6 x 10 ## department promoted review projects salary tenure satisfaction bonus ## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 operations 0 0.578 3 low 5 0.627 0 ## 2 operations 0 0.752 3 medium 6 0.444 0 ## 3 support 0 0.723 3 medium 6 0.447 0 ## 4 logistics 0 0.675 4 high 8 0.440 0 ## 5 sales 0 0.676 3 high 5 0.578 1 ## 6 IT 0 0.683 2 medium 5 0.565 1 ## # ... with 2 more variables: avg_hrs_month <dbl>, left <chr>
Exploratory Data Analysis
glimpse(df) ## Rows: 9,540 ## Columns: 10 ## $ department <chr> "operations", "operations", "support", "logistics", "sal~ ## $ promoted <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~ ## $ review <dbl> 0.5775687, 0.7518997, 0.7225484, 0.6751583, 0.6762032, 0~ ## $ projects <dbl> 3, 3, 3, 4, 3, 2, 4, 4, 4, 3, 4, 3, 3, 3, 3, 3, 4, 4, 3,~ ## $ salary <chr> "low", "medium", "medium", "high", "high", "medium", "hi~ ## $ tenure <dbl> 5, 6, 6, 8, 5, 5, 5, 7, 6, 6, 5, 5, 6, 5, 6, 6, 6, 5, 6,~ ## $ satisfaction <dbl> 0.6267590, 0.4436790, 0.4468232, 0.4401387, 0.5776074, 0~ ## $ bonus <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,~ ## $ avg_hrs_month <dbl> 180.8661, 182.7081, 184.4161, 188.7075, 179.8211, 178.84~ ## $ left <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "n~ skimr::skim(df)
Name | df |
Number of rows | 9540 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
department | 0 | 1 | 2 | 11 | 0 | 10 | 0 |
salary | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
left | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
promoted | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | βββββ |
review | 0 | 1 | 0.65 | 0.09 | 0.31 | 0.59 | 0.65 | 0.71 | 1.00 | βββββ |
projects | 0 | 1 | 3.27 | 0.58 | 2.00 | 3.00 | 3.00 | 4.00 | 5.00 | ββββ β |
tenure | 0 | 1 | 6.56 | 1.42 | 2.00 | 5.00 | 7.00 | 8.00 | 12.00 | βββββ |
satisfaction | 0 | 1 | 0.50 | 0.16 | 0.00 | 0.39 | 0.50 | 0.62 | 1.00 | ββ ββ β |
bonus | 0 | 1 | 0.21 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | βββββ |
avg_hrs_month | 0 | 1 | 184.66 | 4.14 | 171.37 | 181.47 | 184.63 | 187.73 | 200.86 | βββββ |
The summary table above shows no missing values and also outliers in the data. The numerical variables are also normally distributed.
Employee Turnover Rate
Letβs calculate the employee turnover rate in each department. The employee turnover rate is calculated by dividing the number of employees who left the company by the average number of employees (employees at beginning + employees at the end)/2). This number is then multiplied by 100 to get a percentage.
status_count<- as.data.frame.matrix(df %>% group_by(department) %>% select(department, left) %>% ungroup(department) %>% table()) status_count <- status_count %>% mutate(total = no + yes, turnover_rate = (yes/(total + no)/2)*100) status_count %>% as.data.frame() %>% select(turnover_rate) %>% arrange(desc(turnover_rate)) ## turnover_rate ## IT 9.136213 ## logistics 9.113300 ## retail 9.019533 ## marketing 8.927259 ## support 8.426073 ## engineering 8.420039 ## operations 8.358896 ## sales 8.315268 ## admin 8.184319 ## finance 7.758621 mean(status_count$turnover_rate) ## [1] 8.565952 range(status_count$turnover_rate) ## [1] 7.758621 9.136213
The average employee turnover rate is 8.57% with the IT department having the highest employee turnover rate of 9.14% while finance has the lowest employee turnover rate of 7.76%.
Relationship between employer review, job satisfaction and average monthly hours
df %>% select(review, satisfaction, avg_hrs_month) %>% cor() %>% corrplot::corrplot(method = "number")
Does working hours affects employee departure?
Sometimes high working hours might make an employee to leave a company, lack of time for oneβs personal life and family, this makes some employee to ask for a pay raise for the value of the time been spent. Letβs see how this factor relates to employee departure and the salary scale.
df %>% ggplot(aes(x = left, y = avg_hrs_month, colour = left)) + geom_boxplot(outlier.colour = NA) + geom_jitter(alpha = 0.05, width = 0.1) + facet_wrap(vars(salary), scales = "free", ncol = 3) + xlab("Employee Departure") + ylab(" Average working hours in a month")
Are employees leaving as a result of bad reviews from employer?
df %>% ggplot(aes(x = left, y = review, colour = left)) + geom_boxplot(outlier.colour = NA) + geom_jitter(alpha = 0.05, width = 0.1) + facet_wrap(vars(salary), scales = "free", ncol = 3) + xlab("Employee Departure") + ylab("Employer Review")
Are employes not satisfied with their job?
df %>% ggplot(aes(x = left, y = satisfaction, colour = left)) + geom_boxplot(outlier.colour = NA) + geom_jitter(alpha = 0.05, width = 0.1) + facet_grid(cols = vars(promoted))+ xlab("Employee Departure") + ylab("Job Satisfaction")
Model Building
After performing an exploratory data analysis and understanding our data, we will use the xgboost model to predict employee departure. the given dataset.
XGboost Model
### Data Split set.seed(2022) df_split <- initial_split( df, prop = 0.2, strata = left ) #Data Preprocessing xgboost_recipe <- recipe(formula = left ~ ., data = training(df_split)) %>% step_dummy(all_nominal_predictors()) %>% # step_zv removes variables that contain only a single value step_zv(all_predictors()) #model specification xgboost_spec <- boost_tree(trees = 100) %>% set_mode("classification") %>% set_engine("xgboost") #model workflow xgboost_workflow <- workflow() %>% add_recipe(xgboost_recipe) %>% add_model(xgboost_spec) #fit model xgb_fit <- xgboost_workflow %>% fit(training(df_split)) ## [21:54:45] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. #Predicted results on test data pred_class <- predict(xgb_fit, testing(df_split), type = "class") pred_results <- testing(df_split) %>% select(left) %>% bind_cols(pred_class) %>% mutate_at(vars(left), as.factor) #model accuracy accuracy(pred_results, truth = left, estimate = .pred_class) ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.853 #variables importance xgb_fit %>% pull_workflow_fit() %>% vip(geom = "col")
The model accuracy on the test dataset is about 85.3% which is good, the variables average hours per month, job satisfaction and review were shown to be the most important in the model.
Recommendations
Working hours appears to be the most important factor in employee departure. The organization should try to reduce the number of hours spent by an employee especially those working in departments with high turnover rate such as IT department. Most staffs leaving have a good working record with the organization, to encourage staffs staying, the management needs to increase their pay and offer promotion to them when due as a reward for their hard-work. These conditions if met is likely to increase employees job satisfaction, this will help to curb the high rate of employee turnover.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.