Exploratory Data Analysis

The Future of Work and Cityscapes: An Exploratory Data Analysis

research
code
analysis
plot
Author

Heba Nusair

Published

November 3, 2023

Datasets Used in Our Analysis

Survey Overview:

Our analysis is based on datasets primarily sourced from a survey conducted in 2023. This survey, designed and executed by our research team, targeted participants from a wide range of occupations across the USA. Its focus was on examining the significant impact of remote work on urban development and residential choices. The participants provided their insights through a detailed questionnaire, which covered aspects such as working methods (remote or in-person) and their effects on housing location preferences. This rich data offers insights into behavioral patterns related to working modes, thereby illuminating the interplay between human activity and urban spatial dynamics.

Data Modification for Privacy and Educational Use:

It’s important to note that the datasets featured in these blog posts have been modified and synthesized for educational purposes. They demonstrate various data analysis techniques but are not suitable for real-world applications due to these modifications. In compliance with data privacy guidelines, we have anonymized the dataset to maintain the confidentiality of the survey participants. However, these modifications do not detract from the overall integrity and analytical value of the data.

Primary Objective of the Dataset:

The main aim of this dataset is to explore the dynamics of the job market and their influence on urban landscapes. One of the key areas of inquiry in the survey was the industry sector and the preferred working method of the participants. By analyzing these elements, we gain a better understanding of current trends in the job market and how they shape urban environments.

Key Elements of the Dataset:

In recent years, we’ve witnessed significant shifts in work methodologies. A growing number of professionals are embracing the Work From Home (WFH) approach, while a substantial segment continues to follow traditional Work On Site (WOS) practices. This study comprehensively covers 20 distinct industry sectors, focusing on four distinct categories of working methods. The details of these methods are outlined below:

# Working Method
1 Fully remote (working from home or another location)
2 Fully in-person (working at a physical office or location)
3 Hybrid, dominated by in-person work (spending the majority of your work time at a physical office or location, with some remote work)
4 Hybrid, dominated by remote work (spending the majority of your work time working from home or another location, with some in-person work)

Twenty “industry sectors”or jobs are addressed based on the NAICS (North American Industry Classification System) to the 2-digit industry level. these jobs are as the following:

The number of industry sectors in NAICS Description
1 sector 11 Agriculture, Forestry, Fishing and Hunting
2 sector 21 Mining, Quarrying, and Oil and Gas Extraction
3 sector 22 Utilities
4 sector 23 Construction
5 sector 31-33 Manufacturing
6 sector 42 Wholesale Trade
7 sector 44-45 Retail Trade
8 sector 48-49 Transportation and Warehousing
9 sector 51 Information
10 sector 52 Finance and Insurance
11 sector 53 Real Estate and Rental and Leasing
12 sector 54 Professional, Scientific, and Technical Services
13 sector 55 Management of Companies and Enterprises
14 sector 56 Administrative and Support and Waste Management and Remediation Services)
15 sector 61 Educational Services
16 sector 62 Health Care and Social Assistance
17 sector 71 Arts, Entertainment, and Recreation
18 sector 72 Accommodation and Food Services
19 sector 81 Other Services (except Public Administration)
20 sector 92 Public Administration (not covered in economic census)

Essentially, we want to see if the way people work is related to the industry they work in., Chi-Square Test of Independence is used to determine whether there is an association (relationship) between two categorical variables.

In our analysis, we’re testing two ideas:

  • The null hypothesis (H0): There’s no link between an employee’s industry sector and their working method. They’re independent of each other.

  • The alternative hypothesis (Ha): There is a significant link between the industry sector and the working method of employees.

Code
# Load necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

# Load dataset without specifying encoding
data = pd.read_excel('Industry sector and working method.xlsx')

# Clean the column names
data.columns = data.columns.str.strip()


# Summary statistics for Industry sector
industry_stats = data['Industry sector'].value_counts()

# Summary statistics for Working Method
method_stats = data['Working Method'].value_counts()

# Custom labels for Working Method, with line breaks
working_method_labels = [
    'Fully remote\n(working from home or another location)',
    'Fully in-person\n(working at a physical office or location)',
    'Hybrid, dominated by\nin-person work',
    'Hybrid, dominated by\nremote work'
]

# Creating a contingency table
contingency_table = pd.crosstab(data['Industry sector'], data['Working Method'])

# Performing the Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

chi2, p, dof, expected
(163.68777281130448,
 3.007204248307361e-12,
 57,
 array([[ 5.84958872, 18.44183314, 10.40423032,  3.30434783],
        [ 4.00235018, 12.61809636,  7.1186839 ,  2.26086957],
        [ 6.00352526, 18.92714454, 10.67802585,  3.39130435],
        [ 8.1586369 , 25.72150411, 14.51116334,  4.60869565],
        [12.31492362, 38.82491187, 21.90364277,  6.95652174],
        [ 6.00352526, 18.92714454, 10.67802585,  3.39130435],
        [11.2373678 , 35.42773208, 19.98707403,  6.34782609],
        [ 7.54289072, 23.78025852, 13.4159812 ,  4.26086957],
        [ 9.8519389 , 31.05992949, 17.52291422,  5.56521739],
        [ 7.08108108, 22.32432432, 12.59459459,  4.        ],
        [ 3.84841363, 12.13278496,  6.84488837,  2.17391304],
        [ 8.00470035, 25.23619271, 14.2373678 ,  4.52173913],
        [ 7.38895417, 23.29494712, 13.14218566,  4.17391304],
        [ 4.00235018, 12.61809636,  7.1186839 ,  2.26086957],
        [ 8.46650999, 26.69212691, 15.05875441,  4.7826087 ],
        [ 4.92596945, 15.52996475,  8.76145711,  2.7826087 ],
        [ 3.54054054, 11.16216216,  6.2972973 ,  2.        ],
        [ 4.31022327, 13.58871915,  7.66627497,  2.43478261],
        [ 5.07990599, 16.01527615,  9.03525264,  2.86956522],
        [ 3.386604  , 10.67685076,  6.02350176,  1.91304348]]))

The output of the Chi-Square Test of Independence gives us a chi-square statistic of approximately 163.69 and a p-value of approximately 3.00e-12. The degree of freedom (dof) for this test is 57.

  • Chi-Square Statistic (χ²): The score of 163.69 is much higher than what we’d expect by chance, suggesting a strong relationship between work method and industry sector.

  • P-Value: At a value of approximately 3.00e-12, this is well below the standard threshold of 0.05, we would reject the null hypothesis (H0) and conclude that there is a statistically significant association between the ‘Industry sector’ and ’Working Method.’ This means the way people work is related to the industry they work in.

  • Degrees of Freedom (dof): With 57 degrees of freedom, our test has a wide variety of potential outcomes, reinforcing the robustness of our chi-square statistic.

  • Expected Frequencies: These numbers show us the distribution we’d expect if there were no relationship at all between the industry sector and the working method. But our actual data deviates from this, which backs up our finding of a significant association.

Also, the plot is designed to visually explore the relationship between the industry sectors of employees and their working methods. It illustrates how the distribution of working methods varies across different industry sectors, allowing us to identify notable patterns or differences. Such visualizations are crucial for understanding the dynamics of the job market, particularly in the context of evolving work methodologies.

Code
# Plotting with adjusted figure size and subplot parameters
plt.figure(figsize=(14, 8))  # Further increase the figure size

# Industry Sector plot
plt.subplot(1, 2, 1)
sns.countplot(data=data, x='Industry sector')
plt.title('Distribution of Industry Sectors')


# Plotting with horizontal bar chart
plt.figure(figsize=(6, 5))  # Adjust figure size for horizontal plot

# Working Method plot (horizontal)
sns.countplot(data=data, y='Working Method')
plt.title('Distribution of Working Methods')
plt.yticks(range(len(working_method_labels)), working_method_labels)  # Set y-axis labels

plt.tight_layout()
plt.show()

crosstab = pd.crosstab(data['Industry sector'], data['Working Method'])
crosstab.plot(kind='bar', stacked=True)

<Axes: xlabel='Industry sector'>

Additionally, the dataset contains several key elements that will be crucial for our upcoming statistical analyses. The most important elements include:

ELement # Description Data Type
1 Unique identifier for each response Numerical
2 Abbreviation for country names Categorical
3 State or region within the country Categorical
4 Gender identity of the respondent Categorical
5 Age range or specific age Ordinal
6 Marital status of the respondent Categorical
7 Race identification of the respondent Categorical
8 Highest level of education achieved Ordinal
9 Number of days working in person Numerical
10 Specific city or county of residence Categorical
11 Specific city or county of work Categorical
12 Age of the company Numerical
13 Number of employees or workers Numerical
14 Distance to work Numerical
15 Housing status Categorical
16 Number of employed household members Numerical
17 Household size Numerical
18 Desired family size Numerical
19 Income amount Numerical
20 Sector of employment Categorical
21 Position held at work Textual
22 Weekly working hours Numerical
23 Work mode (e.g., remote, in-person Categorical
24 Importance of physical interaction Ordinal
25 Mode of transportation Categorical