Exploring the Relationship Between Income and Living Location Preference

A Comparative Analysis of Linear and Non-Linear Regression Models

research
code
analysis
plot
Author

Heba Nusair

Published

November 10, 2023

Introduction

Living in Roanoke: Does Your Income Decide Your Address? We’ve dived into the world of Roanoke’s remote workers to see how much their earnings influence where they live. Using linear regression, we’ve connected the dots between income and neighborhood choices in this vibrant metropolitan area. Check out our eye-opening findings in the map below!

Mapping Roanoke’s Living Spaces: This map outlines the urban, suburban, and rural zones within the Roanoke Metropolitan Area, highlighting the diverse residential options. Source: Author’s own creation.

Our dataset included responses from individuals, detailing their income levels and their preferred living areas, ranging from the bustling city center to the tranquil rural landscapes. We analyzed responses detailing income levels and preferred living areas, from busy city centers to peaceful rural settings; (City_center, Urban_area, Suburban_area, Rural_area, Monthly_income). Our unique “relocating index” helped us turn these preferences into measurable data. So again, the variables used are processed as the following:

  • Income is converted into a continuous scale, assigning monetary values to income brackets.

  • living location preferences is converted into a single, ordinal dependent variable: moving from urban to rural index’.

Insights from Linear Regression Analysis

Employing linear regression to model the relationship between the two variables: income & Relocating index.

Code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_excel('Income_LivingLocationPrefrences.xlsx')

# Handle NaN values
data.dropna(subset=['City_center', 'Urban_area', 'Suburban_area', 'Rural_area', 'Monthly_income'], inplace=True)

# Convert the Location preferences into a single ordinal dependent variable
area_to_number = {'City_center': 1, 'Urban_area': 2, 'Suburban_area': 3, 'Rural_area': 4}
data['living_Location_preference'] = data[['City_center', 'Urban_area', 'Suburban_area', 'Rural_area']].idxmin(axis=1).map(area_to_number)

# Convert income to a continuous scale based on the provided income brackets
income_mapping = {1.0: 625, 2.0: 2292, 3.0: 5000, 4.0: 6666}
data['continuous_income'] = data['Monthly_income'].map(income_mapping)

# Scale the income feature
scaler = StandardScaler()
data['scaled_income'] = scaler.fit_transform(data[['continuous_income']])

# Prepare the features and target variable for modeling
X = data[['scaled_income']]
y = data['living_Location_preference']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Output the performance metrics
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Debugging the sizes of arrays
print("Sizes of arrays for plotting:")
print("X_test['scaled_income']: ", len(X_test['scaled_income']))
print("y_test: ", len(y_test))
print("y_pred: ", len(y_pred))

# Plotting
plt.scatter(X_test['scaled_income'].values, y_test.values, color='black', label='Actual Data')
plt.scatter(X_test['scaled_income'].values, y_pred, color='red', label='Predicted Data', alpha=0.5)

# Optionally, create a more continuous line for predictions
sorted_order = np.argsort(X_test['scaled_income'].values)
plt.plot(X_test['scaled_income'].values[sorted_order], y_pred[sorted_order], color='blue', linewidth=2, label='Regression Line')

plt.xlabel('Monthly income')
plt.ylabel('Living Location Preference:Relocating index_From urban to rural')
plt.title('Income vs Living Location Preference Linear Regression')
plt.legend()
plt.show()
Mean Squared Error: 1.2658068009077692
R-squared: 0.006851215790502296
Sizes of arrays for plotting:
X_test['scaled_income']:  162
y_test:  162
y_pred:  162

In this plot, the actual data points (black dots) showed significant deviation from the blue regression line (the model’s predictions). The red dots, representing the model’s predicted values, also varied widely from many actual data points, indicating a mismatch between the model’s predictions and the real data.

Unveiling the Linear Regression Insights for WFH Workers

Our linear regression model yielded the following insights:

  • Mean Squared Error (MSE): The model showed an MSE of 1.2658068009077692. This number, though not extremely high, indicates some discrepancies between the model’s predictions and the actual data.

  • R-squared Value: We obtained an R-squared value of 0.006851215790502296. This low value suggests that our model might not be capturing the complete picture, especially in the context of WFH employees.

Transitioning to Non-Linear Models

The visual and numerical analysis led us to consider that the relationship between income and living area preferences might not be linear. This prompted a shift towards exploring non-linear patterns, We Chose a Multinomial Logistic Regression!!!! for these key reasons:

  1. Appropriate for Ordinal Data: Our data on living location preferences is ordinal (ranging from city centers to rural areas). Ordinal regression accurately handles such data, unlike linear regression.

  2. Captures Non-Linear Patterns: This model is better suited to reveal non-linear relationships between income and living location preferences, which our initial analysis suggested.

  3. Enhanced Predictive Accuracy: Ordinal regression aligns more closely with our real-world data, potentially offering improved accuracy in predictions.

Implementing Multinomial Logistic Regression Analysis

In this analysis. the location preferences into a single variable suggests that the resulting data is ordinal in nature. Here’s why:

  • The values assigned (1 for ‘City_center’, 2 for ‘Urban_area’, 3 for ‘Suburban_area’, and 4 for ‘Rural_area’) imply an order or ranking.

  • The order seems to represent a gradient from the most urbanized area (‘City_center’) to the least (‘Rural_area’).

Given this structure, the variable ‘preference rank’ would be more appropriate for ordinal regression analysis since it reflects a clear order or hierarchy in the data.

Code
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_excel('Income_LivingLocation_GradientBoostingRegressor.xlsx')  
# Update the path to your file

# Combine the area preferences into a single ordinal variable
def get_preference_rank(row):
    if row['City_center'] == 1:
        return 1
    elif row['Urban_area'] == 1:
        return 2
    elif row['Suburban_area'] == 1:
        return 3
    elif row['Rural_area'] == 1:
        return 4

data['living_area_preference'] = data.apply(get_preference_rank, axis=1)

# Prepare the features and the target variable
X = data[['Monthly_income', 'Education_Level', 'House_owner_or_renter', 'number of employees in household']]
y = data['living_area_preference']

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the ordinal regression model
model = sm.MNLogit(y_train, X_train)
result = model.fit()

# Output the model summary
print(result.summary())
Optimization terminated successfully.
         Current function value: 1.336686
         Iterations 5
                            MNLogit Regression Results                            
==================================================================================
Dep. Variable:     living_area_preference   No. Observations:                  680
Model:                            MNLogit   Df Residuals:                      665
Method:                               MLE   Df Model:                           12
Date:                    Tue, 05 Dec 2023   Pseudo R-squ.:                 0.03553
Time:                            02:00:32   Log-Likelihood:                -908.95
converged:                           True   LL-Null:                       -942.43
Covariance Type:                nonrobust   LLR p-value:                 1.174e-09
====================================================================================================
        living_area_preference=2       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
const                                2.5322      0.630      4.020      0.000       1.298       3.767
Monthly_income                      -0.6509      0.124     -5.230      0.000      -0.895      -0.407
Education_Level                      0.0909      0.055      1.642      0.101      -0.018       0.199
House_owner_or_renter               -0.6254      0.262     -2.388      0.017      -1.139      -0.112
number of employees in household    -0.2076      0.160     -1.296      0.195      -0.522       0.106
----------------------------------------------------------------------------------------------------
        living_area_preference=3       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
const                                2.8972      0.633      4.575      0.000       1.656       4.138
Monthly_income                      -0.7957      0.126     -6.294      0.000      -1.043      -0.548
Education_Level                      0.0914      0.056      1.626      0.104      -0.019       0.202
House_owner_or_renter               -0.6095      0.264     -2.307      0.021      -1.127      -0.092
number of employees in household    -0.2320      0.163     -1.426      0.154      -0.551       0.087
----------------------------------------------------------------------------------------------------
        living_area_preference=4       coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
const                                0.9399      0.628      1.496      0.135      -0.292       2.171
Monthly_income                      -0.5330      0.124     -4.314      0.000      -0.775      -0.291
Education_Level                      0.1038      0.055      1.893      0.058      -0.004       0.211
House_owner_or_renter               -0.0429      0.241     -0.178      0.859      -0.516       0.430
number of employees in household     0.0543      0.160      0.341      0.733      -0.258       0.367
====================================================================================================

The Multinomial Logistic Regression analysis revealed some interesting insights about the relationship between various factors and living area preferences:

  1. Model Convergence and Fit: The model successfully converged after 5 iterations, indicating a reliable fit to the data. The Pseudo R-squared value of 0.03553, while modest, suggests that our model has some explanatory power, though other unaccounted factors might also play a significant role.

  2. Significant Predictors:

    • Monthly Income: This was a significant predictor across all living area preferences (2, 3, and 4). The negative coefficients (-0.6509, -0.7957, and -0.5330) indicate that as monthly income increases, the likelihood of preferring urban (2) or suburban (3) areas over rural (4) areas decreases.

    • House Ownership Status: This variable also showed significance in influencing living area preference, with negative coefficients suggesting that those who own houses or are renters are less likely to prefer urban or suburban areas compared to rural ones.

  3. Other Factors:

    • Education Level: While the coefficients were positive, suggesting a higher likelihood of preferring urban or suburban areas with increased education level, the significance was marginal.

    • Number of Employees in Household: This factor did not show a strong influence on living area preference, as indicated by the higher p-values.

  4. Coefficient Interpretation: The coefficients for each predictor vary for different living area preferences, reflecting the complex nature of these relationships. For instance, the impact of monthly income is more pronounced in preferring suburban areas (living_area_preference=3) than in urban or rural areas.

In conclusion, our Multinomial Logistic Regression model sheds light on how factors like income and house ownership status significantly influence living area preferences among WFH workers. The nuanced differences in coefficients across different living areas underscore the complexity of these relationships.