Regression and Classification with Housing Data

12 Jan 2019

Reading time ~10 minutes

Business Case Overview

You work for a real estate company interested in using data science to determine the best properties to buy and re-sell. Specifically, your company would like to identify the characteristics of residential houses that estimate the sale price and the cost-effectiveness of doing renovations.

There are three components to the project:

Estimate the sale price of properties based on their “fixed” characteristics, such as neighborhood, lot size, number of stories, etc.
Estimate the value of possible changes and renovations to properties from the variation in sale price not explained by the fixed characteristics. Your goal is to estimate the potential return on investment (and how much you should be willing to pay contractors) when making specific improvements to properties.
Determine the features in the housing data that best predict “abnormal” sales (forclosures, etc.).

Approach

See code here.

Other Notes

General Structure for a ML project:

Business Executive Summary / READme file

Clearly establish the problem statement / project objective. Project objectives should be clear and verifiable (this will help to shortlist the predictor variables)
Define what your target variable will be (if it is a binary outcome, then it is a classification problem, otherwise, regression. This will determine what model you will be using)
Data dictionary: this should contain a list of all the variables used, their data type and what the represent
Recommendations/Conclusions: this should clearly link the results back to the problem statement and provide clear recommendations on actions that the business can take moving forward. Limitations of the model should also be clearly stated, alongside areas for further improvement

Data preparation

2.1 Read in the required libraries:

Pandas for data manipulation / basic plotting (you will be using this the most for concatenating data, doing your EDA, renaming columns, dropping columns etc)
Numpy for numerical/stats calculations (usually for generating random variables, e.g. for hyper parameter optimisation or generate a square root or standard deviation etc)
Seaborn and Matplotlib pyplot for more customised plotting
sklearn.model selection (train_test_split, KFold, cross_val_score)
sklearn.linear_model (LinearRegression, LogisticRegression, LassoCV, RidgeCV)
sklearn.preprocessing (StandardScaler)
sklearn.metrics (r2_score)
statsmodels.api (for OLS and summary stats)

2.2 Loading the data:

locate the csv file and instantiate it as an object
read the csv file into a dataframe using pandas
data = pd.read_csv(import_path, keep_default_na=False, na_values=[‘’]) (what this does is that it drops all null values at the reading in stage, null values being defined as empty data cells. If ‘NA’ has been input as a value, that is not considered a null value. The distinction is that NA means that there is nothing, whereas a null value could potentially be any value, just that it has not been filled up) > this is optional, you may or may not want to drop all the null values (clarify with Clarence if this immediately drops all the null values > i.e. the entire row with the null value)

Exploratory Data Analysis

3.1 Pandas functions for exploring data

pd.shape (shows you the number of rows and columns)
pd.isnull( ).sum( ) (gives you the total number of null values in your dataset) you can also divide it by the total number of rows to give you the percentage of null values per variable)
pd.describe (gives you the summary stats - can use it to scout out values that look suspiciously high or low when looking at the min / max values)
pd.info( ) (gives you more info on the variable object types - cross reference this with the data dictionary to figure out if all data is cast in the correct data type)

Other functions:

use df.[col].unique( ) (to give you a list of all unique data entries. This is useful for exploring for errant values like str types in float type columns, wrong spelling etc)

3.2 Renaming columns: df.columns = [x.lower().replace(‘ ‘,’_’) for x in df.columns] (this removes any blank spaces and ensures that all column names are standardised to lower case) If possible, try to rename your columns so that they are intuitive and easy to call on later

3.3 Understand the data distribution and inter-data relationships better:

Plot pairplot (to understand the relationship between target and predictor variables - whether linear or not)
Plot histogram of all predictor variables to figure out the normality of the plot and if further transformations are required
Plot box plot of predictor variables to figure out if there are any outliers and general distribution of data
Plot scatter plot (not strictly necessary if there is already pair plot and box plot - but can also be used to identify outliers)
important to drop outlier values before running the correlations so that the correlations are not skewed

3.4 Data Cleaning

Splitting up the data-frame into numerical and categorical variables if necessary
Deal with the null values (you could drop them if they are few in number, or you could impute them with the median value)
Could drop the entire column if you deem the variable to be non-essential / if there are too many null values
Deal with errant values (drop any non-consistent data types, correct any spelling errors etc)
Remove all outliers (can manually remove them or if data is normally distributed, can remove based on x number of std.devs away from the mean)

3.4 Understand inter-data relationships better:

Run correlations on numerical variables (short list top regressors against target by using .sort_values( ) )
Plot heat map showing the strength of correlations (initial shortlist of vars highly correlated to target variable, eyeball predictor variables with high interdependence)
shortlist of top correlated numerical variables

Feature engineering / Data transformation

4.1 Categorical data

split into Ordinal, Nominal and Binary
For Ordinal, write data dictionary assigning values and map it to each ordinal variable. (Make sure ordinal scale is encoded in a way that makes interpretation intuitive)
For Nominal (can consider mean encoding, otherwise just use one-hot encoding)
For Binary (one-hot encoding) be careful to understand which variable is being used as the baseline as this will impact interpretation of the coefficient

Route back to EDA here (could run chi-square / ANOVA test to short list most highly correlated variables though if all num then can just use .corr ( ) )

4.2 Explore polynomial transformations

4.3 Other interactions

Aggregation (for data type that are very similar e.g. basement sq feet, total floor sq feet you could just aggregate them)
Multiply (multiple dummies for the same thing e.g. presence of garage, presence of car park etc)

4.4 Other transformations

normalise data e.g. by taking a log (but rmb to re-transform it when interpreting the coefficients - in the case of log, by taking the exponent of the coefficients)

Feature Selection

5.1 Top correlated numerical variables 5.2 Top correlated transformed categorical variables

Model Evaluation

8.1 Train-Test-Split

using sklearn cross_val_score to split the data into the training and testing components

8.2 Scaling

predictor variables to be scaled. Target variable does not need to be scaled
fitting and scaling the data based on the training set using StandardScaler
test set should be scaled using the fitted model (but not re-fitted)

8.3 K-Fold Cross Evaluation

report individual K-fold scores and overall mean R^2 score
test should be run on both the original (100%) unscaled set and the (80% or wtv x%) scaled training set

8.4 Lin Reg Model fit (establish as baseline model)

report training and test scores
test score should approximate the k-fold mean score
calculate RMSE

8.5 OLS model fit

report adjusted R^2 score
F-stat (<0.05 is good)

Individual variables T-test p-value score

(Hyper) Tune Parameters

9.1 Lasso and Ridge Regression using grid search (to select for optimal alpha value) 

Check for Lasso and Ridge coefficients and drop those that have low to zero magnitudes

9.2 Use VIF to check for multicollinearity in variables and recursively drop variables that are >5.0 for VIF

9.3 Drop all variables that have high T-test p-value

9.4 Can explore the use of Sklearn’s recursive feature elimination

Final Model Selection

10.1 Model score comparison

Compare the results of all the various models and select the one with the highest R^2 test score. In the even of a tie, always choose the simpler model (less computationally expensive)
RMSE values

10.2 Validate Model Assumptions

linear regression model operates on various assumptions that must be validated in order for the model to hold true
plot regressions to show heteroskedasticity (i.e. there is no correlation btw error values and target variables and that they are normally distributed)

Generate Predictions / Results, recommendations and conclusions

11.1 Presenting the results

clearly present the final selected features and the magnitude of their coefficients
explain what the different metrics mean (e.g. R^2 value, RMSE - what is big or small > can use predicted over forecast % to compare also)
explain the intuition behind why each of the variables were selected
Clearly state the limitations of the model
Clearly state recommendations moving forward based on the results of your predictions

Project 2 learnings:

Remove categories with low variation? (e.g. if the mode of the variable comprise more than 80% of the total dataset?)since these are variables that tend to exhibit high multicollinearity?
Using a log model to ’normalise’ the distribution of the variables that exhibit skewness
Validate the assumptions that the linear regression model requires (e.g. normality of errors, normality of distribution of the predictive vars?) Always check that the residuals are normally distributed
Using diagrams (scatter for num, box plot for categorical etc to get a sensing of the data distribution)
Aggregating similar type columns (e.g summing the total square footage for basement + first floor)
Exploring other possible types of interactions (e.g. subsetting the top 10 correlated regressors and taking a squared of square root of these columns)
List the top 10 qualities (along with their coefficients) > i.e. order the shortlisted features based on the magnitude of their coefficients
Doing a sense check by cross checking with the mean sale price of houses with specific attributes stated in your model (to see if house prices are indeed positively or negatively impacted by the features that you’ve selected)
Can consider writing a simple function that calculates % of null values per column instead of just summing the null values
Write a (utility) function that helps to identify problematic entries (by specifying all valid data types per column)
Using feature engineering / data transformation (to amalgamate similar categorical variables)
Important to differentiate btw ’null’ and ’NA’ values. ’Null’ values are empty values (meaning that it could be any value) VS ’NA’ (which specifically means that there is nothing)
For numerical variables (run VIF to weed out multicollinearity)
Important to understand how you’re transforming your variables (e.g. your ordinal scaling must make intuitive sense by scaling it from a smaller to a larger numerical variable based on poor to excellent)
Understanding how your dummy variable is constructed is also important as the dummies are measured in relation to the baseline (so if your baseline is the best variable, then everything else would have a negative coefficient in relation to it - which may be counter-intuitive)
For scorings remember to note what you’re scoring against - whether against the scaled/non-scaled variables / test / training set
When dropping outlier values based on standard deviation (rmb that the 67, 95, 99 rule only applies to normally distributed variables) > so if you find that you’re dropping a whole chunk of data based on 3 standard deviations that your dataset is obviously not normally distributed > important to define what outliers mean to you and why those are outliers based on what assumptions
Ensure that the impact of your model is clear (e.g. ranking your features, clearly stating the coefficients etc) also good to clearly state the limitations of your model (what it can or cannot be used for)
Good to emphasise how much your model has improved over time (e.g. even if it’s just comparing it to the naive mean model)
Provide context (e.g. what does the RMSE mean) how big or small is this RMSE value that you are showing? (Either you explain what RMSE means, or you do a relative comparison with the RMSE of the baseline model)