In the last two posts I created some simple decision trees and tested their accuracy. Now it’s time to try some other models. As before I’m going to continue predicting the variable FiveHundredPlus with a limited set of factors to keep the processing pressures down. Once I’m a bit more confident I’ll move to the larger dataset and a more powerful machine. I’m going to use the package caret and recreate this post from Analytics Vidhya.

Full code saved on my github page here.

### 1 – Install the caret package

install.packages("caret", dependencies = c("Depends", "Suggests"))

This takes a very long time which is a bit annoying since I have to create a virtual machine and start from scratch every time I want to use R (my home PC is a bit rubbish). But you can plan around this fairly easily.

### 2 – Prepare the data

The initial part of this data preparation is the same as the previous post. The differences are around preparing the data into a form that is ready for machine learning.

#Converting every categorical variable to numerical using dummy variables dmy <- dummyVars(" ~ .", data = Initialdata, fullRank = T) Initialdata_transformed <- data.frame(predict(dmy, newdata = Initialdata)) #Checking the structure of transformed train file str(Initialdata_transformed)

### 3 – Split the data

This method is better than the was used in my previous post because the split keeps a similar proportion of the response variable in each partition

#Spliting training set into two parts based on outcome: 75% and 25% index <- createDataPartition(sales2016_transformed$FiveHundredPlus, p=0.75, list=FALSE) trainSet <- sales2016_transformed [ index,] testSet <- sales2016_transformed [-index,]

### 4 – Make some models

Now it’s time to make some simple models. In this post I’m going to make a gbm, neural net, glm and a random forest without any tuning. The first time I tried this I left the variables District and Town in there and that was a mistake. These had between them over 1200 levels which meant that even on a month’s worth of data and what I thought was a powerful AWS machine nothing had really happened after 6 hours. So the models below don’t have these variables included.

#Factors for the model outcomeName<-'FiveHundredPlus' predictors<-colnames(trainSet)[!(colnames(trainSet) %in% c("Price", "FiveHundredPlus"))] #Try applying some models model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm') model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf') model_nnet<-train(trainSet[,predictors],trainSet[,outcomeName],method='nnet') model_glm<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm')

### 5 – Test the models

I tested the models using the same methods as in my previous post on where I made decision trees. How do these new, albeit without any tuning, compare to the decision trees?

From the last post:

Model | Accuracy | AUC |

Decision Tree 1 | 0.920 | 0.722 |

Decision Tree 2 | 0.920 | 0.722 |

Decision Tree 3 | 0.920 | 0.729 |

Decision Tree 4 | 0.920 | 0.850 |

Decision Tree 5 | 0.920 | 0.867 |

From the new models (smaller training and test datasets so not strictly a good comparison):

Model | Accuracy | AUC |

GBM | 0.919 | 0.870 |

Random Forest | 0.920 | 0.613 |

Neural Net | 0.919 | 0.879 |

GLM | 0.919 | 0.875 |

ROC Curves (I can’t work out how to put a legend on):

Something odd is clearly going on with the random forest but the other three are fairly similar.

6 – Improvements

The aim of this post was to get an output, not make a good model. The next task will be to do just that. How can I input better variables? How can I tune the models? What is the best model for this job?