Exploring Roanoke Workers’ Residential Choices: A Data-Driven Analysis into Roanoke’s Metropolitan Living Choices
Random Forest Classifier
research
code
analysis
plot
Author
Heba Nusair
Published
November 20, 2023
Deciding where to live for Workers on-site (WOS) is a big decision for everyone, and for the people of Roanoke, it’s no different. We all have our own checklist when it comes to choosing our neighborhoods – some of us want to be close to work, others are looking for good schools for the kids, and some might prioritize a big backyard over everything else.
Factors that might influence the residential decision-making process for Workers on_Site
To get to the heart of what matters most to Roanoke’s residents, we turned to data. Using a detailed survey filled with personal insights from locals. the dataset has been modified and synthesized for educational purposes. They demonstrate various data analysis techniques. We analyzed a range of factors from work life to family plans, all to answer one question: What drives people's choices about where they live?
Armed with Python, a popular programming language, and a machine learning tool called Random Forest, we dug into the data. Think of Random Forest as a detective that examines all the evidence (or data) and identifies the usual suspects (or factors) that play a role in workers’ home-place choices.
The goal of using a Random Forest Classifier is to identify the factors that influence the residential preferences of Workers on-Site (WOS) in the Roanoke Metropolitan Area. Here are the specific objectives Random Forest helps to achieve:
Feature Importance: Random Forest can determine the relative importance of each factor (like distance to work, income level, family size) in predicting residential preferences. This insight can inform urban planning and real estate development.
Predictive Modeling: It can predict an individual’s preferred living area based on the features in the dataset, which could be useful for personalized recommendations or targeted marketing for real estate.
Handling Complexity: Random Forest is robust to complex interactions between features and can handle non-linear relationships without the need for transformation, making it well-suited for diverse and complex datasets.
Reducing Overfitting: Due to its ensemble nature (combining multiple decision trees), Random Forest is less prone to overfitting compared to individual decision trees, leading to more reliable predictions.
Versatility: It can handle both categorical and numerical data, making it versatile for datasets that contain a mix of different types of variables as is often the case in survey data.
Understanding Population Segments: By examining which features are most influential, stakeholders can understand different segments of the population better and tailor their services or policies accordingly.
In essence, Random Forest acts as a powerful analytical tool that turns a complex array of data points into actionable insights about what drives people’s choices on where to live.
After some thorough data cleaning to make sure we were working with accurate information, we let the Random Forest algorithm get to work. It looked at various details about people’s lives, including how far they live from work, their family size, and their income.
Code
import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.ensemble import RandomForestClassifier# Load the datasetdata = pd.read_excel('Classification (In_person_Workers).xlsx')# Replace non-breaking spaces in column namesdata.columns = data.columns.str.replace('\xa0', ' ', regex=True)# Correct any potential typos in column names# Ensure these column names exactly match those in DataFramedata.rename(columns={'living costs ': 'living costs'}, inplace=True)# Convert the first four columns into a single ordinal dependent variable# Ensure these column names exactly match those in the DataFramearea_columns = ['City Center (Central Business District)', 'Urban area', 'Suburban area', 'Rural Area']# For each row, it finds the column among the specified area_columns that has the minimum value # (which would be equivalent to the highest preference rank, assuming 1 is the most preferred and # larger numbers indicate lower preferences) and stores the name of this column (i.e., the living area # with the highest preference for that row) as the value in the new Preferred_Living_Location column.data['Preferred_Living_Location'] = data[area_columns].idxmin(axis=1)data.drop(columns=area_columns, inplace=True)# Handle missing values (NaNs) for both features and target variabledata.dropna(inplace=True)# Define the list of categorical and continuous features# Replace these with the actual column names from the DataFramecategorical_features = ['marital status', 'race', 'education level', 'firm age', 'firm size', 'owner or renter','number of workers in household', 'people in household', 'desired family size in future','monthly earning', 'industry sector'] # categorical features continuous_features = ['distance from work to homeplace'] # continuous features here# Preprocessing pipelinepreprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), continuous_features), ('cat', OneHotEncoder(), categorical_features) ])# Preprocess the dataX = preprocessor.fit_transform(data.drop('Preferred_Living_Location', axis=1))y = data['Preferred_Living_Location'].astype(str) # Convert to string if categorical# Split the dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Train the Random Forest Classifierrf = RandomForestClassifier()rf.fit(X_train, y_train)# Get feature importancesfeature_names = continuous_features +list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))importances = rf.feature_importances_sorted_indices = np.argsort(importances)[::-1]# Display feature importancesprint("Feature Importances:")for i in sorted_indices:print(f'{feature_names[i]}: {importances[i]}')
Feature Importances:
distance from work to homeplace: 0.09453283019109973
monthly earning_2.0: 0.022378250819534228
monthly earning_3.0: 0.021684316728467937
firm age_3.0: 0.021395730438756977
firm size_3.0: 0.02081121497280845
education level_5.0: 0.02018522077135225
number of workers in household_1.0: 0.020156467063806846
firm age_2.0: 0.01979195624999849
race_1.0: 0.019516626694370515
desired family size in future_5.0: 0.018957499645068498
firm size_2.0: 0.018897518943371695
people in household_4.0: 0.01817934687877516
marital status_2.0: 0.018016911846259354
owner or renter_1.0: 0.01768402911160392
desired family size in future_4.0: 0.017582239249672234
desired family size in future_3.0: 0.01742578171407909
firm size_4.0: 0.017088420714682175
people in household_5.0: 0.01699151201683179
education level_6.0: 0.01678791473759889
number of workers in household_2.0: 0.01673988262142541
monthly earning_1.0: 0.01670785972132809
people in household_3.0: 0.016507304013161896
education level_2.0: 0.016453235846924957
owner or renter_2.0: 0.016116383287129214
desired family size in future_2.0: 0.01607623306364935
firm age_4.0: 0.016015638840330568
marital status_1.0: 0.015992055116930633
industry sector_Manufacturing: 0.01595581212010698
firm age_5.0: 0.015090440662749047
number of workers in household_3.0: 0.01507204508923355
people in household_2.0: 0.014710947479305757
industry sector_Wholesale Trade: 0.014487374339946847
desired family size in future_1.0: 0.014404205227122506
firm size_5.0: 0.01430121103839377
firm size_1.0: 0.013427339389809563
industry sector_Retail Trade: 0.013285402498357048
industry sector_Professional, Scientific, and Technical Services: 0.012951961501941722
race_3.0: 0.012099863755579672
industry sector_Mining, Quarrying, and Oil and Gas Extraction: 0.011779577830475653
education level_3.0: 0.01130672983201177
industry sector_Management of Companies and Enterprises: 0.010906230885122057
race_2.0: 0.010800046526841703
industry sector_Information: 0.009890281401284413
industry sector_Transportation and Warehousing: 0.00982613446653056
education level_7.0: 0.009752643511913836
race_5.0: 0.009479340181712604
monthly earning_4.0: 0.009436346556728238
people in household_1.0: 0.009432082859209625
race_4.0: 0.009343855824641523
industry sector_Accommodation and Food Services: 0.00929571435873622
education level_8.0: 0.009207696672829283
industry sector_Educational Services: 0.009127878194233732
firm age_1.0: 0.008836934299482166
industry sector_Construction: 0.008319649079866083
marital status_3.0: 0.008166023420532957
education level_4.0: 0.008149798808340422
industry sector_Health Care and Social Assistance: 0.008097776236381415
race_8.0: 0.008097360971043424
marital status_5.0: 0.008052470677102627
industry sector_Finance and Insurance: 0.007940636076324123
marital status_6.0: 0.007929238289531331
marital status_4.0: 0.007094990496840496
race_6.0: 0.006779323327993779
industry sector_Other Services [except Public Administration: 0.006756798222843938
race_7.0: 0.00663881550980142
industry sector_Utilities: 0.006237721923848308
industry sector_Arts, Entertainment, and Recreation: 0.006119160629582487
industry sector_Agriculture, Forestry, Fishing and Hunting: 0.006069659287106633
education level_1.0: 0.005877755783132467
industry sector_Real Estate and Rental and Leasing: 0.004485166609639422
industry sector_Public Administration: 0.003717982726120862
industry sector_Administrative and Support and Waste Management and Remediation Services: 0.002589194120649625
General Interpretation:
The features with higher importance scores are more influential in determining the preferred living area. This doesn’t necessarily mean causation but indicates a stronger association.
The model suggests that practical considerations like distance to work, firm age, earnings, and household characteristics play a significant role in determining living area preferences.
Important Considerations:
The Random Forest model treats the problem as a classification task and does not inherently account for the ordinal nature of the dependent variable.
The results should be interpreted as indicative rather than definitive. For more precise modeling of ordinal outcomes, specific ordinal regression models would be more appropriate.
In summary, the model’s output provides a useful guide to understanding which factors are most influential in determining living area preferences, with distance from work appearing as the most significant factor according to this model.]
The results were clear: For workers who commute to work; the distance from workplace to homeplace stood out as the number one factor. This wasn’t too surprising – after all, who wants to spend hours in traffic when they could be home relaxing? Income was also a key factor. It seems the more we earn, the more choices we have about where we live.
But it’s not all about money and travel time. The data also showed us that personal factors, like whether someone is married or planning for a big family, have a big say in the decision too.
This isn’t just interesting information. It’s valuable knowledge for city planners and developers. By understanding what’s important to residents, they can plan and build communities that better meet people’s needs and make Roanoke an even better place to live.
So next time you’re driving through Roanoke’s neighborhoods, remember that each home you see represents a decision made based on a unique mix of practical needs and personal dreams.
A Visual Journey Through Roanoke’s Residential Decision Drivers
In our previous discussion, we delved into the factors that influence where WOS in Roanoke choose to live. We discovered that the daily commute and income levels are among the top considerations. But how do these factors stack up against each other? Let’s take a visual dive into the data.
Code
import matplotlib.pyplot as pltimport seaborn as sns# Assuming 'importances', 'feature_names' are already defined as in the previous code# Sorting the feature importancessorted_indices = np.argsort(importances)[::-1]sorted_importances = importances[sorted_indices]sorted_features = [feature_names[i] for i in sorted_indices]# Selecting a number of top features to display (you can adjust this number)n_top_features =20# for example, top 20 features# Creating the plotplt.figure(figsize=(6.5, 6.5))sns.barplot(x=sorted_importances[:n_top_features], y=sorted_features[:n_top_features])plt.title('Significance of factors in residential decision-making process')plt.xlabel('Level of importance')plt.ylabel('Factors')plt.show()
The accompanying chart lays out the findings from our Random Forest analysis in a colorful display of bars, each representing a factor’s weight in the decision-making process. At a glance, the lengthier bars catch the eye—these are the heavy hitters of home location preference.
Dominating the scene is ‘distance from work to homeplace,’ a beacon of red at the top, which resonates with anyone who’s ever been caught in the snarl of rush hour. This is followed closely by a cascade of oranges and yellows highlighting the importance of ‘monthly earnings’ across various brackets and the age and size of firms.
As we move down the chart, the colors shift to cooler greens and blues, representing factors like household size and marital status. These may not be the titans of the plot, but they still play a pivotal role in the collective narrative of where we live.
What’s truly intriguing is the story this visual tells us—beyond numbers and models, it’s about real-life choices and priorities. For instance, the significance of ‘people in household_4.0’ and ‘education level_5.0’ (a bachelor’s degree) speaks to the aspirations and day-to-day realities that shape our sense of place.
City planners and housing developers, take note: this chart is more than just a pretty picture. It’s a roadmap to the hearts and minds of Roanoke’s residents, guiding the way to communities that not only house but also support and enrich their lives.
As we continue to shape Roanoke’s urban landscape, let’s keep these visual cues in mind, ensuring that every development, from high-rises to townhomes, aligns with the genuine preferences of those who will call them home. After all, understanding these preferences is the key to building not just houses but homes where life’s stories can unfold.