Simple machine learning in R - Caret

In the last two posts I created some simple decision trees and tested their accuracy. Now it’s time to try some other models. As before I’m going to continue predicting the variable FiveHundredPlus with a limited set of factors to keep the processing pressures down. Once I’m a bit more confident I’ll move to the larger dataset and a more powerful machine. I’m going to use the package caret and recreate this post from Analytics Vidhya.

Full code saved on my github page here.

1 – Install the caret package

install.packages("caret", dependencies = c("Depends", "Suggests"))

This takes a very long time which is a bit annoying since I have to create a virtual machine and start from scratch every time I want to use R (my home PC is a bit rubbish). But you can plan around this fairly easily.

2 – Prepare the data

The initial part of this data preparation is the same as the previous post. The differences are around preparing the data into a form that is ready for machine learning.

#Converting every categorical variable to numerical using dummy variables
dmy <- dummyVars(" ~ .", data = Initialdata, fullRank = T)
Initialdata_transformed <- data.frame(predict(dmy, newdata = Initialdata))

#Checking the structure of transformed train file

3 – Split the data

This method is better than the was used in my previous post because the split keeps a similar proportion of the response variable in each partition

#Spliting training set into two parts based on outcome: 75% and 25%
index <- createDataPartition(sales2016_transformed$FiveHundredPlus, p=0.75, list=FALSE)
trainSet <- sales2016_transformed [ index,]
testSet <- sales2016_transformed [-index,]

4 – Make some models

Now it’s time to make some simple models. In this post I’m going to make a gbm, neural net, glm and a random forest without any tuning. The first time I tried this I left the variables District and Town in there and that was a mistake. These had between them over 1200 levels which meant that even on a month’s worth of data and what I thought was a powerful AWS machine nothing had really happened after 6 hours. So the models below don’t have these variables included.

#Factors for the model
predictors<-colnames(trainSet)[!(colnames(trainSet) %in% c("Price", "FiveHundredPlus"))]
#Try applying some models

5 – Test the models

I tested the models using the same methods as in my previous post on where I made decision trees. How do these new, albeit without any tuning, compare to the decision trees?

From the last post:

Model Accuracy AUC
Decision Tree 1 0.920 0.722
Decision Tree 2 0.920 0.722
Decision Tree 3 0.920 0.729
Decision Tree 4 0.920 0.850
Decision Tree 5 0.920 0.867

From the new models (smaller training and test datasets so not strictly a good comparison):

Model Accuracy AUC
GBM 0.919 0.870
Random Forest 0.920 0.613
Neural Net 0.919 0.879
GLM 0.919 0.875

ROC Curves (I can’t work out how to put a legend on):

Something odd is clearly going on with the random forest but the other three are fairly similar.

6 – Improvements

The aim of this post was to get an output, not make a good model. The next task will be to do just that. How can I input better variables? How can I tune the models? What is the best model for this job?

Leave a Reply

Your email address will not be published. Required fields are marked *