In the last two posts I created some simple decision trees and tested their accuracy. Now it’s time to try some other models. As before I’m going to continue predicting the variable FiveHundredPlus with a limited set of factors to keep the processing pressures down. Once I’m a bit more confident I’ll move to the larger dataset and a more powerful machine. I’m going to use the package caret and recreate this post from Analytics Vidhya.
Full code saved on my github page here.
1 – Install the caret package
install.packages("caret", dependencies = c("Depends", "Suggests"))
This takes a very long time which is a bit annoying since I have to create a virtual machine and start from scratch every time I want to use R (my home PC is a bit rubbish). But you can plan around this fairly easily.
2 – Prepare the data
The initial part of this data preparation is the same as the previous post. The differences are around preparing the data into a form that is ready for machine learning.
#Converting every categorical variable to numerical using dummy variables dmy <- dummyVars(" ~ .", data = Initialdata, fullRank = T) Initialdata_transformed <- data.frame(predict(dmy, newdata = Initialdata)) #Checking the structure of transformed train file str(Initialdata_transformed)
3 – Split the data
This method is better than the was used in my previous post because the split keeps a similar proportion of the response variable in each partition
#Spliting training set into two parts based on outcome: 75% and 25% index <- createDataPartition(sales2016_transformed$FiveHundredPlus, p=0.75, list=FALSE) trainSet <- sales2016_transformed [ index,] testSet <- sales2016_transformed [-index,]
4 – Make some models
Now it’s time to make some simple models. In this post I’m going to make a gbm, neural net, glm and a random forest without any tuning. The first time I tried this I left the variables District and Town in there and that was a mistake. These had between them over 1200 levels which meant that even on a month’s worth of data and what I thought was a powerful AWS machine nothing had really happened after 6 hours. So the models below don’t have these variables included.
#Factors for the model outcomeName<-'FiveHundredPlus' predictors<-colnames(trainSet)[!(colnames(trainSet) %in% c("Price", "FiveHundredPlus"))] #Try applying some models model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm') model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf') model_nnet<-train(trainSet[,predictors],trainSet[,outcomeName],method='nnet') model_glm<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm')
5 – Test the models
I tested the models using the same methods as in my previous post on where I made decision trees. How do these new, albeit without any tuning, compare to the decision trees?
From the last post:
Model | Accuracy | AUC |
Decision Tree 1 | 0.920 | 0.722 |
Decision Tree 2 | 0.920 | 0.722 |
Decision Tree 3 | 0.920 | 0.729 |
Decision Tree 4 | 0.920 | 0.850 |
Decision Tree 5 | 0.920 | 0.867 |
From the new models (smaller training and test datasets so not strictly a good comparison):
Model | Accuracy | AUC |
GBM | 0.919 | 0.870 |
Random Forest | 0.920 | 0.613 |
Neural Net | 0.919 | 0.879 |
GLM | 0.919 | 0.875 |
ROC Curves (I can’t work out how to put a legend on):
Something odd is clearly going on with the random forest but the other three are fairly similar.
6 – Improvements
The aim of this post was to get an output, not make a good model. The next task will be to do just that. How can I input better variables? How can I tune the models? What is the best model for this job?