Py-Caret
   5 min read    Rohit Pruthi

Author details

Rohit Pruthi is a Decision Scientist at R2DL Rolls Royce India
This is based on open data, not specific to any company’s real information.
Some of the outputs have been removed due to compatibility with markdown format.

References

  1. https://pycaret.readthedocs.io/en/latest/index.html
  2. https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
  3. https://www.linkedin.com/learning/applied-ai-for-human-resources

Motivation - Why this work?

In late 2019 and early 2020, I started working with people analytics team and got exposed to how HR groups use analytics currently.

While this was mostly visualization, I was increasingly interested to see how advanced algorithms might be used with HR data.

Furthermore, starting from April 2020, I was exposed to people management in my new role and within no time matters of curosity became matters of urgency!

Based on a linkedin learning course work and IBM released Kaggle open data, I started to explore three key aspects of HR analytics, the first one of which I cover and briefly present in this study.

Motivation - from a data professional’s perspective

I have also been trying to explore some of the low code applications around data science. This also gives me an opportunity to look into one of these tools. I have tried to explore PyCaret which is an extension of the CARET library used in R.

This is what PyCaret creators say about it “PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.”

Read on here for more details (https://pycaret.readthedocs.io/en/latest/index.html)

Attrition prediction

Predicting attrition is an often used problem while exploring the HR issues. I have picked the same to explore pycaret. This is based on a simulated dataset released by IBM on Kaggle a few years back.

1 Loading Data

Start with loading the data and libraries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## Don't write warnings!
import warnings
warnings.filterwarnings('ignore')

## Load relevant packages
import pandas as pd
import os

import numpy as np

## Pycaret classification library
from pycaret.classification import *

## Read data
attrition_data = pd.read_csv("archive/WA_Fn-UseC_-HR-Employee-Attrition.csv")

#print("Data Loaded:\n------------------------\n",attrition_data.dtypes)
#attrition_data.head()
1
attrition_data.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
  • This is a rich data set, including a wide variety of features
  • while some parameters are more straightforward in terms of data collection
  • organizations may struggle in gathering or sharing data
  • with respect to other conditions whether driven by compliance or process maturity.

2 Real data simulation

Split data into seen and unseen before next steps. Note that this is not test-train split. That would happen later in the setup command. This is a simulation of the real scenario when unseen data is used for predictions post model development.

1
2
3
4
5
6
data = attrition_data.sample(frac=0.8, random_state=42)
data_unseen = attrition_data.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (1176, 35)
Unseen Data For Predictions: (294, 35)

3 Setup transformtation pipelines

Setup function from pycaret provides a session with inferential algorithm on data types as well as transformation pipeline.

User can confirm if the inferred data types make sense as well as if there is missing data or any further transformations needed. These can be included in the function as we will explore in later steps

This step seems to take away a lot of the preprocessing steps. If you see the options below, there are imputation, transformation, data splitting including stratification, imbalance conditioning and even PCA feature generation options.

1
attrition_experiment_01 = setup(data = data, target = 'Attrition', session_id=42)

4 Model Iteration

This is where the user friendliness of pycaret comes into play. Model fit and results grid are a matter of single line code. compare_models() provides a simple interface to run multiple models and see which works best for this data set, can also be sorted by different metrics

1
best_model = compare_models(sort='F1')
1
print(best_model)
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

NOTE : Pycaret provides the below list of models to run - at the rate it is evolving and getting contributions more models should be added.

1
models()

5 Final model run & tune

Run the best model result once again. At this step, we try to run the model from previous step and tune it.

1
lda = create_model('lda')
1
lda_tuned = tune_model(lda, optimize='F1')

6 Model Post processing

  • This is another area where pycaret simplifies the process a lot.
  • I could plot model results for comparison,
  • get feature importance regardless of model being used
  • get AUC and PR curves as well as confusion matrix without any detailed coding
1
plot_model(lda_tuned, plot = 'auc')

png

1
plot_model(lda_tuned, plot = 'pr')

png

1
plot_model(lda_tuned, plot = 'confusion_matrix')

png

1
plot_model(lda_tuned, plot = 'feature')

png

7 Test data results

Post this, we see the model results on test data.

1
predict_model(lda_tuned);

Note that this is not the last step, after this the model will be finalized (implying it will be retrained on whole data for final deployment), saved and then deployed.

8 Some interpretation

  • Work life balance plays a key role as one of the highest contribution parameters
  • Over time spent also has a high role to play
  • it looks like job as a sales representative impacts the attrition as well
  • followed by years since last promotion and years with current manager
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import matplotlib.pyplot as plt

BarPlot_columns=['OverTime','WorkLifeBalance','YearsInCurrentRole']

def Bar_plots(var):
    col=pd.crosstab(attrition_data[var],attrition_data.Attrition)
    col.div(col.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False, figsize=(8,4))
    plt.xticks(rotation=90)
    
for col in BarPlot_columns:
    Bar_plots(col)

png

png

png