DataraFlow Week 14: Regression Analysis
In data science, moving from raw data to a working machine learning model is rarely a straight line. It’s more like a careful assembly process that requires cleaning, reshaping, and rigorous testing before any real predictions can happen. This guide breaks down that process, using a Python-based analysis to show how missing data, categorical variables, and feature scaling come together to build solid regression models.
1. Data Preprocessing
Before we can even think about predicting outcomes, we have to make sure our data is actually usable. The analysis here focuses on three main pillars: filling in the gaps, translating categories for machines, and leveling the playing field for our features.
Handling Missing Data
Real-world data is messy and full of holes. In this analysis, we tackled a customer dataset that had missing entries using SimpleImputer from the sklearn library.
- For the Numbers: We used mean imputation for continuous variables like
AgeandIncome. This just means filling in the blanks with the average value. It’s quick, efficient, and keeps the overall distribution of the data intact without skewing the results.
from sklearn.impute import SimpleImputer
# Define columns
X = ['Age', 'Income', 'Product_Rating']
# Initialize mean imputer
meanImputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Apply transformation
df[X] = meanImputer.fit_transform(df[X])
- For the Categories: When it came to columns like
City, we used mode imputation—filling in missing spots with the most common value. This ensures we’re using the most statistically likely category to plug the gaps.
Y = ['City']
# Initialize mode imputer
modeImputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df[Y] = modeImputer.fit_transform(df[Y])
Categorical Encoding
Machine learning models don’t speak "text"—they only understand numbers. To bridge that gap, we used two key techniques:
- One-Hot Encoding: We applied this to nominal variables like
CityandProduct_Type. Think of it as creating a checklist: instead of one column saying "Paris" or "London," you get separate columns for each city with a simple 1 or 0. This stops the model from assuming there’s some kind of order or ranking between cities that doesn't exist.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Apply OneHotEncoder to categorical columns (indices 0 and 1)
ct = ColumnTransformer(transformers=[
('one_hot_encoder', OneHotEncoder(categories='auto', sparse_output=False), [0, 1])
], remainder='passthrough')
X = np.array(ct.fit_transform(df[X]), dtype=float)
- Label Encoding: For simpler, binary choices like
Purchased(Yes/No), we usedLabelEncoderto turn them into straightforward 0s and 1s.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(Y)
Feature Scaling
Imagine trying to compare Annual_Salary (which might be in the tens of thousands) with Age (which is usually under 100). The huge difference in scale can confuse the model, making it prioritize the larger numbers just because they’re bigger. To fix this, we used StandardScaler, which standardizes everything to a common scale. It’s a crucial step that helps optimization algorithms run smoother and faster.
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
numericalFeatures = ['Age', 'Annual_Salary', 'Years_Experience', 'Performance_Score']
# Fit on training data and transform both train and test sets
df_train[numericalFeatures] = sc_X.fit_transform(df_train[numericalFeatures])
df_test[numericalFeatures] = sc_X.transform(df_test[numericalFeatures])
2. Predictive Modeling: Linear Regression
Once the data was clean, we moved on to the core task: regression. We looked at both simple cases and more complex, multi-variable problems.
Simple Linear Regression: Advertising vs. Sales
We started with a straightforward question: Can we predict Sales_Revenue based on Advertising_Spend? The data showed a very strong linear relationship.
from sklearn.linear_model import LinearRegression
# Train the model
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
# Make predictions on the test set
Y_pred = regressor.predict(X_test)
- How Good Was It? The model scored an $R^2$ of 0.9984, which effectively means it captured 99.8% of the variance. In plain English, advertising spend is an almost perfect predictor of revenue in this dataset.
# Calculating metrics
r2 = r2_score(Y_test, Y_pred)
mse = mean_squared_error(Y_test, Y_pred)
print(f"R-squared: {r2:.4f}")
# Output: R-squared: 0.9984
- In Practice: Using the regression equation, the model could confidently forecast revenue for any given budget—for instance, predicting exactly what a $50,000 investment would yield.
ad = np.array([50000]).reshape(1, -1)
sales = regressor.predict(ad)
print('For an investment of $50000, the revenue is projected to be ${:.2f}'.format(sales[0, 0]))
Multiple Linear Regression: Startup Profitability
Things got more interesting when we tried to predict startup profits. This time, we had multiple factors to consider: R&D Spend, Administration, Marketing Spend, and Location.
Optimizing with Backward Elimination. Instead of just throwing every variable into the model, we used a technique called backward elimination to trim the fat. Initially, we included everything. But statistical tests showed that variables like Administration and State weren't actually driving profit—they were just noise.
Using statsmodels, we removed these weak links one by one. The result? An optimized model that zeroed in on the one factor that really mattered: R&D Spend. For this specific dataset, innovation was the single biggest driver of success.
import statsmodels.api as sm
# Add a column of ones for the constant (intercept)
X = np.append(arr=np.ones((58,1)).astype(int), values=X, axis=1)
# Fit OLS model with all potential predictors
X_opt = X[:, [0,1,2,3,4,5,6]]
regressor_OLS = sm.OLS(endog=Y, exog=X_opt).fit()
# Check summary to identify high P-values
# regressor_OLS.summary()
# Iteratively remove features with P > 0.05
# Example: Removing indices 1, 2, 5, 6 sequentially...
X_opt = X[:, [0,3,4]] # Final selection
regressor_OLS = sm.OLS(endog=Y, exog=X_opt).fit()
3. Real-World Application: Housing Price Prediction
Finally, we applied everything we learned to a housing market dataset.
The Pipeline: We built a full workflow that one-hot encoded neighborhoods, converted binary features like Pools and Garages, and scaled the numerical data.
The Comparison: We tested two models against each other.
Model 1 (Full): Threw in every available feature.
Model 2 (Optimized): Only used the features that passed our statistical checks.
The Verdict: Both models performed exceptionally well, with R-squared (R^2) scores of around 0.99. However, the optimized model was simpler and cleaner, relying on key drivers like
Neighborhood,Pool,Garage, andProperty Taxto make its predictions.
# Comparing Full vs Optimized predictions
plt.figure(figsize=(14, 5))
# Scatter plot of Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(Y_test, Y_pred, color='blue', alpha=0.5, label='Model 1')
plt.scatter(Y_test, Y_opt_pred, color='red', alpha=0.5, label='Model 2')
plt.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], 'k--', lw=2)
plt.title('Actual vs Predicted')
plt.legend()
plt.show()