Yoong Kang Lim

Simple random forests example

Note: I am experimenting with a format for my data science blog posts. This post was imported from a Jupyter notebook that I created recently, which explains the nature of the writing. As some tables may be too big to be displayed here, I have removed them for the purpose of keeping everything readable.

If you would like to download the Jupyter notebook, please click here for the GitHub repository.


This is a challenge from Kaggle that I wanted to experiment with. The goal is to apply Random Forests to a fairly structured dataset.

This is from a competition called “Bluebook for Bulldozers”. The description of the challenge is available at: https://www.kaggle.com/c/bluebook-for-bulldozers

I’m going to start by importing the data and just preparing it for training. I won’t look at it too much, just enough to see what kind of data I’m dealing with.

There is a folder called data which contains the Train.csv dataset.

Let’s start by importing all the libraries we need in Python, and set up some configuration.

%load_ext autoreload
%autoreload 2

%matplotlib inline

import pandas as pd
import numpy as np

Some data manipulation

With the pandas library imported, we can now read the CSV data to see what we’re dealing with.

data_raw = pd.read_csv('./data/Train.csv', low_memory=False)

We have 53 features and I can’t really see all of them. Let me transpose the data to see better.

data_raw.head().transpose()
0 1 2 3 4
SalesID 1139246 1139248 1139249 1139251 1139253
SalePrice 66000 57000 10000 38500 11000
MachineID 999089 117657 434808 1026470 1057373
ModelID 3157 77 7009 332 17311
datasource 121 121 121 121 121
auctioneerID 3 3 3 3 3
YearMade 2004 1996 2001 2001 2007
MachineHoursCurrentMeter 68 4640 2838 3486 722
UsageBand Low Low High High Medium
saledate 11/16/2006 0:00 3/26/2004 0:00 2/26/2004 0:00 5/19/2011 0:00 7/23/2009 0:00
fiModelDesc 521D 950FII 226 PC120-6E S175
fiBaseModel 521 950 226 PC120 S175
fiSecondaryDesc D F NaN NaN NaN
fiModelSeries NaN II NaN -6E NaN
fiModelDescriptor NaN NaN NaN NaN NaN
ProductSize NaN Medium NaN Small NaN
fiProductClassDesc Wheel Loader - 110.0 to 120.0 Horsepower Wheel Loader - 150.0 to 175.0 Horsepower Skid Steer Loader - 1351.0 to 1601.0 Lb Operat... Hydraulic Excavator, Track - 12.0 to 14.0 Metr... Skid Steer Loader - 1601.0 to 1751.0 Lb Operat...
state Alabama North Carolina New York Texas New York
ProductGroup WL WL SSL TEX SSL
ProductGroupDesc Wheel Loader Wheel Loader Skid Steer Loaders Track Excavators Skid Steer Loaders
Drive_System NaN NaN NaN NaN NaN
Enclosure EROPS w AC EROPS w AC OROPS EROPS w AC EROPS
Forks None or Unspecified None or Unspecified None or Unspecified NaN None or Unspecified
Pad_Type NaN NaN NaN NaN NaN
Ride_Control None or Unspecified None or Unspecified NaN NaN NaN
Stick NaN NaN NaN NaN NaN
Transmission NaN NaN NaN NaN NaN
Turbocharged NaN NaN NaN NaN NaN
Blade_Extension NaN NaN NaN NaN NaN
Blade_Width NaN NaN NaN NaN NaN
Enclosure_Type NaN NaN NaN NaN NaN
Engine_Horsepower NaN NaN NaN NaN NaN
Hydraulics 2 Valve 2 Valve Auxiliary 2 Valve Auxiliary
Pushblock NaN NaN NaN NaN NaN
Ripper NaN NaN NaN NaN NaN
Scarifier NaN NaN NaN NaN NaN
Tip_Control NaN NaN NaN NaN NaN
Tire_Size None or Unspecified 23.5 NaN NaN NaN
Coupler None or Unspecified None or Unspecified None or Unspecified None or Unspecified None or Unspecified
Coupler_System NaN NaN None or Unspecified NaN None or Unspecified
Grouser_Tracks NaN NaN None or Unspecified NaN None or Unspecified
Hydraulics_Flow NaN NaN Standard NaN Standard
Track_Type NaN NaN NaN NaN NaN
Undercarriage_Pad_Width NaN NaN NaN NaN NaN
Stick_Length NaN NaN NaN NaN NaN
Thumb NaN NaN NaN NaN NaN
Pattern_Changer NaN NaN NaN NaN NaN
Grouser_Type NaN NaN NaN NaN NaN
Backhoe_Mounting NaN NaN NaN NaN NaN
Blade_Type NaN NaN NaN NaN NaN
Travel_Controls NaN NaN NaN NaN NaN
Differential_Type Standard Standard NaN NaN NaN
Steering_Controls Conventional Conventional NaN NaN NaN

There is a mixture of categorical and numerical data. There is also a date column that I might want to convert to a numerical format. There are also a large number of NaN values, which we need to fill.

Let’s address a few things first. The Kaggle challenge specified that the score of interest is the Root Mean Square Log Error (RMSLE). So let’s apply a transformation to the SalePrice column.

data_raw['SalePrice'] = np.log(data_raw['SalePrice'])
data_raw['SalePrice'].head()
0    11.097410
1    10.950807
2     9.210340
3    10.558414
4     9.305651
Name: SalePrice, dtype: float64

Next, we need to deal with the date column, saledate. This is what it looks like:

data_raw['saledate'].head()
0    11/16/2006 0:00
1     3/26/2004 0:00
2     2/26/2004 0:00
3     5/19/2011 0:00
4     7/23/2009 0:00
Name: saledate, dtype: object

It’s probably a lot more useful to transform this to some features. We have features like year, month, day of month, day of week, is_weekend, etc. To make things simple, I’ll just pull out the day of month, month, and year:

saledates = data_raw['saledate'].astype('datetime64')
data_raw['salesYear'] = saledates.dt.year
data_raw['salesMonth'] = saledates.dt.month
data_raw['salesDay'] = saledates.dt.day
data_raw = data_raw.drop('saledate', axis=1)

Now, there are other string data, or “categorical” data. Pandas actually has a Categorical type but it’s not converted to Categorical by default (probably for performance).

We don’t actually need to convert these columns to Categorical. Instead we will use a function called get_dummies() to perform one-hot encoding on these columns.

data = pd.get_dummies(data_raw)
len(data.columns)
7725

Okay, that created 7725 columns. That’s a lot of columns, and usually at this point I’d do some analysis to remove things that aren’t relevant. But it doesn’t stop me from training a classifier.

Now all the data is numerical, but there is still one thing left to do.

Remember we had a large number of NaN values. There are different approaches to deal with this, but one common way is to fill it with the median of the column.

So let’s go through all 7725 columns and fill in NaN values. Pandas conveniently has a function called fillna to do precisely this.

for col in data.columns:
    median = data[col].median()
    data[col].fillna(median, inplace=True)

Learning

We are now ready for learning. We should split our data into training and validation sets. We’ll use a 0.25 split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('SalePrice', axis=1),
    data['SalePrice'],
    test_size=0.25,
    random_state=42,
)

We will be using an algorithm called Random Forests, which is an ensemble supervised learning algorithm based on an ensemble of decision trees.

As this is a prediction task (regression) we will use the RandomForestRegressor. For a classification task we may use RandomForestClassifier but this is not a classification task.

Training will take a while…

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_jobs=-1)  # we'll leave everything else as default
model.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Evaluation

Now that it’s been trained, we can see how it does on the training and testing sets.

model.score(X_train, y_train)
0.981710492948687
model.score(X_test, y_test)
0.8970776394527499

That’s actually pretty good. However, there is a bit of difference between the test and training scores (although they’re both pretty high). This suggests a bit of overfitting. We may want to address that later, but I’m going to leave this for now.

Let’s save the model so we don’t have to retrain again.

# save model
import pickle

pkl_filename = "rf_model.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(model, file)

Solution

Finally, I’m going to import the validation and test sets given by Kaggle. My prediction on Valid.csv will determine my rank on the public leaderboard, while Test.csv determines my rank on the private leaderboard. Let me import both of them, and do the same processing.

EDIT 2018/07/27: I later realised I should have used the same medians as the training set.

We need to ensure that the one-hot encodings are the same.

data_raw = pd.read_csv('./data/Valid.csv')
saledates = data_raw['saledate'].astype('datetime64')
data_raw['salesYear'] = saledates.dt.year
data_raw['salesMonth'] = saledates.dt.month
data_raw['salesDay'] = saledates.dt.day
data_raw = data_raw.drop('saledate', axis=1)
public_data_processed = pd.get_dummies(data_raw)
# get columns from training set
train_cols = X_test.columns
test_cols = public_data_processed.columns

missing_cols = (set(train_cols)).symmetric_difference(set(test_cols))
for c in missing_cols:
     public_data_processed[c] = 0

new_cols = set(test_cols) - set(train_cols)
for c in new_cols:
    public_data_processed.drop(c, axis=1, inplace=True)

for col in public_data_processed.columns:
    median = public_data_processed[col].median()
    public_data_processed[col].fillna(median, inplace=True)
y_public = model.predict(public_data_processed)
results_public = public_data_processed[['SalesID']]
results_public = results_public.assign(SalePrice=lambda x: np.exp(y_public))
results_public.to_csv('public_leaderboard.csv', index=False)

Now for Test.csv:

data_raw = pd.read_csv('./data/Test.csv')
saledates = data_raw['saledate'].astype('datetime64')
data_raw['salesYear'] = saledates.dt.year
data_raw['salesMonth'] = saledates.dt.month
data_raw['salesDay'] = saledates.dt.day
data_raw = data_raw.drop('saledate', axis=1)
private_data_processed = pd.get_dummies(data_raw)
# get columns from training set
train_cols = X_test.columns
test_cols = private_data_processed.columns

missing_cols = (set(train_cols)).symmetric_difference(set(test_cols))
for c in missing_cols:
     private_data_processed[c] = 0

new_cols = set(test_cols) - set(train_cols)
for c in new_cols:
    private_data_processed.drop(c, axis=1, inplace=True)

for col in private_data_processed.columns:
    median = private_data_processed[col].median()
    private_data_processed[col].fillna(median, inplace=True)
y_private = model.predict(private_data_processed)
results_private = private_data_processed[['SalesID']]
results_private = results_private.assign(SalePrice=lambda x: np.exp(y_private))
results_private.to_csv('private_leaderboard.csv', index=False)
results_private
SalesID SalePrice
0 1227829 18395.146713
1 1227844 15191.654578
2 1227847 31871.396669
3 1227848 30287.580884
4 1227863 34139.328934
5 1227870 56269.391170
6 1227871 38709.354077
7 1227879 12981.014777
8 1227880 16303.186961
9 1227881 30573.677321
10 1227882 60933.443791
11 1227883 60933.443791
12 1227885 26422.966974
13 1227886 28224.450929
14 1227887 35967.989524
15 1227888 35245.408964
16 1227905 37699.340454
17 1227910 32661.315763
18 1227911 16908.258380
19 1227912 27506.666085
20 1227913 19033.734876
21 1227914 22056.797327
22 1227917 23281.862355
23 1227918 18736.099894
24 1227920 17017.761157
25 1227924 35787.706084
26 1227925 35756.682625
27 1227930 42459.478311
28 1227933 63311.029319
29 1227942 16976.752396
... ... ...
12427 6642319 54807.616745
12428 6642320 52210.226957
12429 6642322 52210.226957
12430 6642323 54807.616745
12431 6642328 37382.953774
12432 6642329 26608.177351
12433 6642330 50848.852496
12434 6642337 43175.704591
12435 6642338 41436.731748
12436 6642356 35259.022695
12437 6642357 37086.450679
12438 6642363 39520.199673
12439 6642391 39340.385196
12440 6642418 12367.845464
12441 6642433 19235.433486
12442 6642434 12875.871941
12443 6642710 32679.243981
12444 6642711 25210.949360
12445 6642946 19465.651981
12446 6643123 31606.186809
12447 6643158 27268.515129
12448 6643164 35750.750539
12449 6643167 37369.816191
12450 6643168 37369.816191
12451 6643170 44584.681651
12452 6643171 57945.646850
12453 6643173 37369.816191
12454 6643184 17323.017177
12455 6643186 44584.681651
12456 6643196 57945.646850

12457 rows × 2 columns