library(tidyverse)
library(doParallel)
library(glue)
library(vroom)
library(janitor)
library(tidymodels)
library(themis)
library(vip)
library(ggcorrplot)
library(fastshap)
library(shapviz)
library(kernelshap)
library(shapper)
library(SHAPforxgboost)
library(stacks)
library(highcharter)
library(patchwork)
library(ggpmisc)
library(parallel)
library(gridExtra)
library(plotly)
Employee data for classification task
Introduction and summary
In today’s corporate environment, understanding the attributes and experiences of employees is paramount. This project analyzes a dataset in [kaggle](Employee dataset (kaggle.com) containing information about the company’s workforce, including their educational backgrounds, demographics, and employment factors. It aims to uncover valuable insights that can inform strategies to optimize workforce management and foster a productive, inclusive, and satisfied work environment.
By delving into the dataset, this project will provide an overview of the educational qualifications, examine the variation in employees’ length of service across cities, explore the correlation between payment tier and domain experience, assess the gender distribution within the workforce, and investigate patterns in leave-taking behavior. These insights can empower the company to make data-driven decisions that enhance its operations and employee well-being.
Objective: To analyze a dataset with employee information to gain insights into workforce management.
Analytical Goals:
- Overview of educational qualifications of employees.
- Analysis of service length variation across cities.
- Exploration of payment tiers in relation to domain experience.
- Assessment of gender distribution in the workforce.
- Investigation of leave-taking patterns.
Methodology:
- Extensive data exploration and cleaning.
- Parallel processing to expedite computations.
- Employed a variety of machine learning models (Random forest, XGboost,Support Vector Machine and Stacking) for classification task.
Conclusion:
- A comprehensive analysis revealed significant correlations between educational background and job performance, underscoring the importance of aligning educational initiatives with career development programs.
- Developed a predictive model with accuracy 84.8%
Here are full script and results
# speed up computation with parallel processing
<- detectCores(logical = FALSE)
all_cores registerDoParallel(cores = all_cores)
Explore the data
<- read_csv("https://raw.githubusercontent.com/vn-quant/project_employee/R/Employee.csv") %>% clean_names()
employee
# a quick look through the data to see if any feature are in the wrong format
%>% glimpse() employee
Rows: 4,653
Columns: 9
$ education <chr> "Bachelors", "Bachelors", "Bachelors", "M…
$ joining_year <dbl> 2017, 2013, 2014, 2016, 2017, 2016, 2015,…
$ city <chr> "Bangalore", "Pune", "New Delhi", "Bangal…
$ payment_tier <dbl> 3, 1, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3,…
$ age <dbl> 34, 28, 38, 27, 24, 22, 38, 34, 23, 37, 2…
$ gender <chr> "Male", "Female", "Female", "Male", "Male…
$ ever_benched <chr> "No", "No", "No", "No", "Yes", "No", "No"…
$ experience_in_current_domain <dbl> 0, 3, 2, 5, 2, 0, 0, 2, 1, 2, 5, 3, 5, 2,…
$ leave_or_not <dbl> 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,…
%>% skimr::skim() employee
Name | Piped data |
Number of rows | 4653 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
education | 0 | 1 | 3 | 9 | 0 | 3 | 0 |
city | 0 | 1 | 4 | 9 | 0 | 3 | 0 |
gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
ever_benched | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
joining_year | 0 | 1 | 2015.06 | 1.86 | 2012 | 2013 | 2015 | 2017 | 2018 | ▆▃▅▃▇ |
payment_tier | 0 | 1 | 2.70 | 0.56 | 1 | 3 | 3 | 3 | 3 | ▁▁▂▁▇ |
age | 0 | 1 | 29.39 | 4.83 | 22 | 26 | 28 | 32 | 41 | ▃▇▂▂▂ |
experience_in_current_domain | 0 | 1 | 2.91 | 1.56 | 0 | 2 | 3 | 4 | 7 | ▅▅▇▅▁ |
leave_or_not | 0 | 1 | 0.34 | 0.48 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
%>%
employee mutate(leave_or_not = factor(leave_or_not,
label = c("stay","leave")),
joining_year = factor(joining_year),
payment_tier = factor(payment_tier)) -> employee_df
As we can see, there is no missing value, but some columns are in the wrong format. For example, the target value leave_or_not
should be factor
instead of integer
.
My goal is to predict the employment decision base on their backgrounds, demographics, and other factors.
ggplot(employee_df) +
aes(x = leave_or_not, fill = leave_or_not) +
geom_bar() +
scale_fill_manual(values = c(`stay` = "#FBC89C",`leave` = "#F47378"),
name = "leaeve or not ")+
stat_count(geom = "text",
colour = "white",
size = 5,
aes(label = after_stat(count)),
position = position_stack(vjust=0.5)) +
theme_minimal()+
theme(legend.position = "bottom",
axis.text.x = element_blank())+
labs(title ="Frequency Bar Plot for target variable",
x = "")
we can observed that an imbalance in the dataset
# Define the list of x-variables to loop through
<- c("education" ,"city","ever_benched" ,"joining_year","payment_tier" ,"gender","experience_in_current_domain")
vars
# Create an empty list to store the plots
<- list()
plots
# Loop through the x-variables and create plots
for (x in vars) {
# Create the ggplot object with the current x-variable
<- ggplot(employee_df) +
plot aes(x = x, fill = leave_or_not) +
geom_bar(position = "dodge") +
labs(y = "",
x="",
subtitle = gsub("_"," ", paste0(x)))+
scale_fill_manual( values = c(`stay` = "#FBC89C",`leave` = "#F47378")) +
stat_count(geom = "text",
colour = "#191339",
size = 4,
aes(label = after_stat(count)),
position=position_dodge(width=0.9), vjust= -0.2)+
theme_minimal()+
theme(legend.position = "none")
# Append the plot to the plot_list
<- plot
plots[[x]]
}
8]] <-
plots[[ggplot(employee_df) +
aes(x = age, fill = leave_or_not) +
geom_density(adjust = 1.5) +
theme_minimal()+
labs(y = "",
x = "",
subtitle = "age")+
scale_fill_manual( values = c(`stay` = "#FBC89C",`leave` = "#F47378")) +
theme(legend.position = "none")
1]] + plots[[2]] +
plots[[3]] + plots[[4]] +
plots[[5]] + plots[[6]] +
plots[[7]] + plots[[8]] +
plots[[plot_layout(ncol = 2) +
plot_annotation(
title = "Relationship between predictors and target variable",
subtitle = glue('<span style = "color:#FBC89C">**stay**</span>',
' or ',
'<span style = "color:#F47378">**leave**</span>'),
theme=theme(plot.subtitle=ggtext::element_markdown(size = 15)))
These features look like they exhibit some big differences in the decision to leave or stay
Modeling
My approach involves exploring three distinct models: Random Forest (RF), XGBoost (XGB), Support Vector Machine (SVM) and Stacking method using the three models to construct a predictive model. The primary objective is to identify the influential features that impact the decision-making process of these models.
The stepwise process begins with data preparation, where i split the data into training and testing sets, and create cross-validation resamples. Following this, i proceed to model training and evaluation, individually exploring the RF, XGB, SVM, and Stacking models. Each model’s performance will be assessed using ROC AUC and accuracy.
Due to the dataset’s class imbalancendefinedndefined, the primary performance metric used for evaluation is the Receiver Operating Characteristic Area Under the Curve (ROC AUC), which offers a robust assessment of model performance. Additionally, accuracy is considered for its ease of interpretation, although it is secondary to ROC AUC in addressing class imbalance.
By following the process above , i aim to develop a predictive model that not only performs well but also provides some insights into the key features that drive the decision of the employee and suggest some solution and conclude conclusions
Data preparation
set.seed(031023)
<- initial_split(employee_df, strata = leave_or_not, prop = 0.85)
employee_split <- training(employee_split)
employee_train <- testing(employee_split)
employee_test
<- vfold_cv(employee_train, strata =leave_or_not, v =5) folds
<- recipe(leave_or_not ~.,
employee_rec data = employee_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(age) %>%
step_smote(leave_or_not)
After spliting the initial dataset into train set and test set and create folds
for the cross-validation, the primary data preprocessing steps include handling nominal variables, normalizing age, and applying SMOTE method to address class imbalance in the target variable.
Random Forest
#random forest model specification
<- rand_forest(
rf_tune_spec mtry = tune(),
trees = 500,
min_n = tune()
%>%
) set_mode("classification") %>%
set_engine("ranger")
#random forest workflow
<- workflow() %>%
rf_tune_wf add_recipe(employee_rec) %>%
add_model(rf_tune_spec)
For the random forest model, i tune 2 parametres:
mtry
: The number of variables randomly sampled as candidates at each split. A smaller value makes the model less prone to overfitting, while a larger value can improve accuracy.min_n
: Control the depth and complexity of the individual trees
Tuning hyperparameters
set.seed(01031023)
<- tune_grid(
rf_tune_result
rf_tune_wf,resamples = folds,
grid = 20,
control = control_stack_grid()
)
%>%
rf_tune_result collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, min_n, mtry) %>%
pivot_longer(min_n:mtry,
values_to = "value",
names_to = "parameter"
%>%
) ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")+
theme_bw()
This grid did not try every combination of min_n and mtry but we can see that lower value of mtry
, from 2 to 5 is fairly good while in case of min_n
, we should focus more when min_n
above 10. I set ranges of the parametre again, but this time more focus in min_n
.
<- grid_regular(
rf_grid mtry(range = c(2, 5)),
min_n(range = c(15, 50)),
levels = 5
)
<- tune_grid(
rf_regular_res
rf_tune_wf,resamples = folds,
grid = rf_grid
)
%>%
rf_regular_res collect_metrics() %>%
filter(.metric == "roc_auc") %>%
mutate(min_n = factor(min_n)) %>%
ggplot(aes(mtry, mean, color = min_n)) +
geom_line(alpha = 0.5, linewidth = 1.5) +
geom_point() +
labs(y = "AUC") +
theme_bw()
Choosing the best model and results
<- finalize_model(
final_rf
rf_tune_spec,select_best(rf_regular_res, "roc_auc")
)
%>%
final_rf set_engine("ranger", importance = "permutation") %>%
fit(leave_or_not ~ .,
data = juice(prep(employee_rec))
%>%
) vip(geom = "col")+
theme_minimal()
The vip plot suggest the joining year 2018 is most categorical. As can be predicted from the graph of the relationship between variables, the accession year 2018 shows quite clear classification possibilities. This could be due to a political/cultural event that caused employees who joined the company during this time to decide to quit, which is unlikely to be a trend given the rates of people joining the company. The company’s decision to quit in 2017 was quite low compared to the retention rate. The following are education (master) and gender (male), which are both reasonable and understandable.
<- workflow() %>%
final_rf_res add_recipe(employee_rec) %>%
add_model(final_rf) %>%
last_fit(employee_split)
%>%
final_rf_res collect_predictions() %>%
conf_mat(truth = leave_or_not,
estimate = .pred_class)
Truth
Prediction stay leave
stay 420 79
leave 38 161
%>%
final_rf_res collect_predictions() %>%
roc_curve(leave_or_not, .pred_stay) %>%
autoplot()
XGboost
The same process to others models
<- boost_tree(
xgb_tune_spec trees = 500,
tree_depth = tune(), ## controls the maximum depth of decision trees
min_n = tune(), ## minimum sum of instance weight
loss_reduction = tune(), ## the minimum loss reduction required to make a further partition on a leaf node
sample_size = tune(), ## controls the fraction of training data that is randomly sampled for each boosting round
mtry = tune(), ## the fraction of features (columns) used to build each tree
learn_rate = tune() ## step size at each iteration while moving towards a minimum of a loss function
%>%
) set_engine("xgboost",
objective = "binary:logistic") %>%
set_mode("classification")
<- workflow() %>%
xgb_tune_wf add_recipe(employee_rec) %>%
add_model(xgb_tune_spec)
Tuning hyperparameters
::registerDoParallel()
doParallel
set.seed(02031023)
<- tune_grid(
xgb_tune_result
xgb_tune_wf,resamples = folds,
grid = 20,
control = control_stack_grid()
)
%>%
xgb_tune_result collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter"
%>%
) ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "roc_auc") +
theme_minimal()
Since the tuning time took forever to complete so i tried several combinations that performed well.
Choosing the best model and SHAP
<- finalize_model(
final_xgb
xgb_tune_spec,select_best(xgb_tune_result,"roc_auc")
)
final_xgb
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 6
trees = 500
min_n = 12
tree_depth = 8
learn_rate = 0.0372029996176391
loss_reduction = 1.62354669613487
sample_size = 0.982596664283192
Engine-Specific Arguments:
objective = binary:logistic
Computational engine: xgboost
<- workflow() %>%
final_xgb_res add_recipe(employee_rec) %>%
add_model(final_xgb) %>%
last_fit(employee_split)
%>%
final_xgb_res collect_predictions() %>%
conf_mat(truth = leave_or_not,
estimate = .pred_class )
Truth
Prediction stay leave
stay 413 76
leave 45 164
%>%
final_xgb_res collect_predictions() %>%
roc_curve(leave_or_not, .pred_stay) %>%
autoplot()
<- prep(employee_rec, employee_train) %>%
X juice() %>%
select(-leave_or_not) %>%
as.matrix()
<- workflow() %>%
xg_mod add_recipe(employee_rec) %>%
add_model(final_xgb) %>%
fit(employee_train) %>%
extract_fit_parsnip()
# Compute shapley values
<- explain(xg_mod$fit, X = X, exact = TRUE) shap
%>%
xgb_tune_wf finalize_workflow(select_best(xgb_tune_result, "roc_auc")) %>%
last_fit(employee_split) %>%
extract_fit_parsnip() -> xgb_fit
<- shap.prep(
shap.prep xgb_model = extract_fit_engine(xgb_fit),
X_train = bake(prep(employee_rec),
has_role("predictor"),
new_data = NULL,
composition = "matrix"
)
)
shap.plot.summary(shap.prep)
The y-axis indicates the variable name, in order of importance from top to bottom. The value next to them is the mean SHAP value. (the higher, the more important)
On the x-axis is the SHAP value. Indicates how much is the change in log-odds. From this number we can extract the probability of the output.
Gradient color indicates the original value for that variable.
Each point represents a row from the original dataset.
In general, SHAP results are similar to VIP plot, but more detailed.
SVM
<- svm_linear(cost = tune(),
svm_linear_tune_spec margin = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
<- workflow() %>%
svm_linear_tune_wf add_recipe(employee_rec %>% step_normalize(age) ) %>%
add_model(svm_linear_tune_spec)
<- tune_grid(
svm_linear_result
svm_linear_tune_wf,resamples = folds,
grid = 10,
control = control_stack_grid()
)
%>%
svm_linear_result collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, cost, margin) %>%
pivot_longer(cost:margin,
values_to = "value",
names_to = "parameter"
%>%
) ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "AUC")+
theme_bw()
<- finalize_model(
final_svm
svm_linear_tune_spec,select_best(svm_linear_result,"roc_auc")
)
final_svm
Linear Support Vector Machine Model Specification (classification)
Main Arguments:
cost = 9.08581669060231
margin = 0.0203436492756009
Computational engine: kernlab
<- workflow() %>%
final_svm_res add_recipe(employee_rec) %>%
add_model(final_xgb) %>%
last_fit(employee_split)
%>%
final_svm_res collect_predictions() %>%
conf_mat(truth = leave_or_not,
estimate = .pred_class )
Truth
Prediction stay leave
stay 414 75
leave 44 165
%>%
final_svm_res collect_predictions() %>%
roc_curve(leave_or_not, .pred_stay) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(linewidth = 1.1, color = "midnightblue") +
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
linewidth = 1
+
)theme_bw()
Stacking
Stacking, short for “stacked generalization”, is an ensemble learning technique that combines multiple classification or regression models via a meta-classifier or a meta-regressor. The base-level models are trained based on a complete training set, then the meta-model is trained on the outputs of the base models as features. Stacking often achieves higher performance than any single model and even other ensemble methods like bagging and boosting.
set.seed(1234)
<- stacks() %>%
stacking_md add_candidates(rf_tune_result) %>%
add_candidates(xgb_tune_result) %>%
add_candidates(svm_linear_result) %>%
blend_predictions()
autoplot(stacking_md, type = "members") +
theme_bw()
autoplot(stacking_md, type = "weight") +
theme_bw()
%>% fit_members() stacking_md
# A tibble: 3 × 3
member type weight
<chr> <chr> <dbl>
1 .pred_leave_rf_tune_result_1_04 rand_forest 2.95
2 .pred_leave_rf_tune_result_1_19 rand_forest 0.873
3 .pred_leave_rf_tune_result_1_10 rand_forest 0.00182
<- employee_test %>%
stacking_result bind_cols(predict(stacking_md %>% fit_members(), ., "prob")) %>%
bind_cols(predict(stacking_md %>% fit_members(), .))
%>%
stacking_result conf_mat(truth = leave_or_not,
estimate = .pred_class )
Truth
Prediction stay leave
stay 443 91
leave 15 149
%>%
stacking_result roc_curve(leave_or_not, .pred_stay) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(linewidth = 1.1, color = "midnightblue") +
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
linewidth = 1
+
)theme_minimal()
Conclusion
<- bind_rows(
final_result %>% collect_predictions() %>% mutate(model = "RF"),
final_rf_res %>% collect_predictions() %>% mutate(model = "XGB"),
final_xgb_res %>% collect_predictions() %>% mutate(model = "SVM"),
final_svm_res %>% mutate(model = "Stacking")
stacking_result %>%
) select(.pred_stay,.pred_leave,.pred_class,leave_or_not,model)
%>%
final_result group_by(model) %>%
roc_curve(leave_or_not, .pred_stay) %>%
autoplot() %>%
ggplotly() %>%
layout(title = list(
text = "Interactive ROC curve",
font = list(size = 14) # Adjust the font size as needed
))
Base on the ROC AUC plot, the perfomance of the four model is quite similar. Let took a closer look to their confusion matrix and some metrics.
# set up metrics
<- c("accuracy","roc_auc","sens","spec","recall","precision")
metrics
<- name <- function(data,model = "") {
conf_mat_plot
%>%
data conf_mat(truth = leave_or_not,
estimate = .pred_class ) %>%
autoplot(type = "heatmap") +
labs(
title = paste0("Confusion matrix ",model)
-> cm_plot
)
%>%
data conf_mat(truth = leave_or_not,
estimate = .pred_class ) %>%
summary() %>%
rbind(roc_auc(data,
leave_or_not,%>%
.pred_stay)) filter(.metric %in% metrics) %>%
mutate(.estimate = round(.estimate,3)) %>%
mutate(.metric = paste0(.metric,": ",.estimate)) %>%
select(.metric) %>%
t() %>%
tableGrob(rows = NULL) -> tbl
return(grid.arrange(cm_plot, bottom = tbl))
}
conf_mat_plot(final_rf_res %>% collect_predictions(),"Random forest")
conf_mat_plot(final_xgb_res %>% collect_predictions(),"XGB")
conf_mat_plot(final_svm_res %>% collect_predictions(),"SVM linear")
conf_mat_plot(stacking_result,"Stacking")
Analysis
Random Forest (RF) exhibits a balanced performance with an accuracy of 0.832 and a ROC AUC of 0.855, indicating its ability to maintain a commendable trade-off between sensitivity (0.917) and specificity (0.671).
XGBoost (XGB) offers a performance closely aligned with RF, with a slightly lower sensitivity of 0.902 but marginally better specificity at 0.683. Its ROC AUC of 0.847 showcases its robust discriminative power.
Support Vector Machine (SVM) mirrors a performance akin to XGB with an accuracy of 0.83. Its specificity stands at 0.688, a tad higher than RF, while maintaining a sensitivity of 0.904.
Stacking leads in accuracy with 0.85 and exhibits an outstanding sensitivity of 0.969. However, this seems to come at the cost of specificity, which at 0.621, is the lowest among the models. Its ROC AUC of 0.852 suggests a strong overall performance but hints at its inclination towards predicting positive instances.
Selection
For applications prioritizing the detection of positive instances, the Stacking model emerges as the top choice due to its unparalleled sensitivity. However, its lower specificity indicates a potential for higher false positives, which is more flavor in some situation.
For scenarios demanding a balanced outcome between true positives and true negatives, the Random Forest model stands out as the most versatile choice
XGB and SVM serve as robust alternatives to RF, with performances that are closely matched, but their complexity make the tranning time inefficient.