Predicting Work Trends Across Industries with Naive Bayes: Remote or On-Site

Probability theory and random variables, Naive Bayes classifier for multivariate Bernoulli models

research
code
analysis
plot
Author

Heba Nusair

Published

November 25, 2023

Introduction:

As we navigate through the evolving landscape of the 2023 job market, we are witnessing a significant shift in work methods. More people are embracing Work From Home (WFH) while others continue with Work On Site (WOS). This post helps to understand how these work methods vary across different jobs in the USA.

My Approach:

For this analysis, I’ve utilized the Naive Bayes classifier for multivariate Bernoulli models. This method is excellent for sorting jobs into two main categories: WFH and WOS. It’s a practical choice for my study because it efficiently handles binary data—like choosing between remote or on-site work.

About the Data:

The data comes from a 2023 survey of 850 participants from various occupations. Participants were asked about their industry sector and working method. To respect data confidentiality, we’ve modified and anonymized the dataset. This ensures compliance with data privacy guidelines without affecting the overall analysis and insights.

Distribution of work arrangements in 2023: a survey of 850 participants across various industries; 205 Remote workers versus 645 On-site Workers.

Industry Breakdown:

The study categorized 20 main category for jobs according to the North American Industry Classification System (NAICS), aligning with the 6-digit Standard Occupational Classification (SOC) system from the U.S. Bureau of Labor Statistics (BLS). Here’s a table of these industry sectors:

#
The occupations categories are based on the 6-digit (SOC) system from the U.S. (BLS)
Industry Sector: Occupational Classification
1
Number of jobs in NAICS sector 11
(Agriculture, Forestry, Fishing and Hunting)
2
cns02 Number of jobs in NAICS sector 21
(Mining, Quarrying, and Oil and Gas Extraction)
3
cns03 Number of jobs in NAICS sector 22
(Utilities)
4
cns04 Number of jobs in NAICS sector 23
(Construction)
5
cns05 Number of jobs in NAICS sector 31‐33
(Manufacturing)
6
cns06 Number of jobs in NAICS sector 42
(Wholesale Trade)
7
cns07 Number of jobs in NAICS sector 44‐45
(Retail Trade)
8
cns08 Number of jobs in NAICS sector 48‐49
(Transportation and Warehousing)
9
cns09 Number of jobs in NAICS sector 51
(Information)
10
cns10 Number of jobs in NAICS sector 52
(Finance and Insurance)
11
cns11 Number of jobs in NAICS sector 53
(Real Estate and Rental and Leasing)
12
cns12 Number of jobs in NAICS sector 54
(Professional, Scientific, and Technical Services)
13
cns13 Number of jobs in NAICS sector 55
(Management of Companies and Enterprises)
14
cns14 Number of jobs in NAICS sector 56
(Administrative and Support and Waste Management and Remediation Services)
15
cns15 Number of jobs in NAICS sector 61
(Educational Services)
16
cns16 Number of jobs in NAICS sector 62
(Health Care and Social Assistance)
17
cns17 Number of jobs in NAICS sector 71
(Arts, Entertainment, and Recreation)
18
cns18 Number of jobs in NAICS sector 72
(Accommodation and Food Services)
19
cns19 Number of jobs in NAICS sector 81
(Other Services [except Public Administration])
20
cns20 Number of jobs in NAICS sector 92
(Public Administration)

Goal: My goal is to provide a clear and real picture of the changing work styles. I am not just analyzing numbers; I am uncovering the stories they tell about how people are adapting to new work environments, be it remotely or on-site.

Why I Chose the Naive Bayes Classifier for Multivariate Bernoulli Models:

In my exploration of work methods—remote and on-site—the Naive Bayes classifier for multivariate Bernoulli models shines with its simplicity and effectiveness. It excels at evaluating probabilities, helping determine whether a job is more likely to be remote or on-site based on its industry sector. This algorithm’s strength lies in its straightforward approach, providing clear baseline assessments that guide our analysis across different industry sectors. Dive Into the Code Behind the Analysis:
x: The industry sector: [jobs categories From 1 to 20]
y: The work method: 1 for WFH [Workers From Home], 2 for WOS [Workers On Site]

Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OneHotEncoder
import os

# Load the dataset
df = pd.read_csv('C:/Users/NUSAI/Desktop/Machine learning/HebaNu.github.io/HebaNu.github.io/HebaNu/posts/Probability theory and random variables/updated_POST1.csv')

# Drop rows where any cell is NaN in the 'Work Method' column
df = df.dropna(subset=['Work Method'])

# One-hot encoding
encoder = OneHotEncoder(sparse=False)
X = encoder.fit_transform(df[[' industry sector']])  
# includes 20 occupation types from NAICS for 6-digit SOC- U.S. BLS

# Target variable
y = df['Work Method']
# 1 for WFH [Work From Home], 2 for WOS [Work On Site]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Bernoulli Naive Bayes model
model = BernoulliNB()
model.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = model.predict(X_test)



# Output the classification report and accuracy
print(classification_report(y_test, y_pred, zero_division=0))
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
              precision    recall  f1-score   support

           1       0.00      0.00      0.00        40
           2       0.76      1.00      0.87       130

    accuracy                           0.76       170
   macro avg       0.38      0.50      0.43       170
weighted avg       0.58      0.76      0.66       170

Accuracy: 0.7647058823529411
C:\Users\NUSAI\anaconda3\Lib\site-packages\sklearn\preprocessing\_encoders.py:972: FutureWarning:

`sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.

Confusion Matrix: Visualizes the correct and incorrect predictions compared to the actual values.

Code
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming y_test and y_pred are already defined from BernoulliNB model
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d")
plt.title('Confusion Matrix for BernoulliNB Model')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

This confusion matrix from our BernoulliNB model reveals a contrast in predictive accuracy between the two classes. While it excels at identifying Work On Site (WOS) instances with a 100% success rate, it falls short regarding Work From Home (WFH) predictions, misclassifying all as WOS. This pattern indicates a strong model bias toward WOS, suggesting the need for further tuning to balance the recognition of both work types and a deeper dive into the feature set and model parameters to ensure a fair representation of the evolving work landscape.

ROC Curve: Plots the true positive rate against the false positive rate.

Code
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt

# Adjust y_test to have binary labels 0 and 1
y_test_binary = y_test.replace({1: 0, 2: 1})

# Get predicted probabilities for the positive class (e.g., class 2)
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Index 1 for the probability of class 2

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test_binary, y_pred_prob)

# Calculate the AUC (Area Under Curve)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
disp = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
disp.plot()
plt.title('ROC Curve for BernoulliNB Model')
plt.show()

The ROC Curve visualizes the performance of the BernoulliNB model by plotting the true positive rate against the false positive rate at various threshold settings. The AUC, or Area Under the Curve, is 0.60, which indicates the model’s ability to distinguish between the classes. While an AUC of 1 represents a perfect model, an AUC closer to 0.5 suggests the model performs no better than random chance. Therefore, with an AUC of 0.60, our model shows some ability to differentiate between WFH and WOS, but there’s room for improvement.

From Data to Insights

We’re not just crunching numbers; we’re seeking stories that reflect real-life work experiences. Our goal is to provide a snapshot of how people are adjusting to new work realities, and we’ll keep refining our approach to get the full picture.

What’s Next?

We’ll continue to monitor these trends and bring you fresh insights. Stay tuned for more updates as we delve deeper into the world of work and what it means for all of us in these changing times.