Now that the land registry data has been imported and had some initial exploratory work done to it lets have a go at making a price prediction model. I’ll use a small subset of the data and initially only try to predict whether or not the house is worth more or less than £500k, rather than the more complicated process of predicting the price. The code used in this post is largely based upon the DataCamp course “Introduction to Machine Learning”. Code for this project is on my GitHub page here. This post focuses on decision trees using the package rpart.
The initial part of the code just setting up the data. We then define the variable “FiveHundredPlus”
sales2016$FiveHundredPlus <- ifelse(sales2016$Price<=500000, 0, 1)
It’s 0 if the house sold for less than £500k and 1 if it sold for more. It would be nice to be jumping straight into predicting the final price but this course seems to start off by predicting classifier variables rather than continuous variables.
Then we split the data into a training dataset and a test dataset
#Randomly pick 70% of the data index <- sample(1:nrow(Initialdata),size = 0.7*nrow(Initialdata)) #Select the 70% subset, and the testing dataset (30%) and only keep the columns I want cols <- c("FiveHundredPlus", "Transfer_Month", "Property_Type", "Old_New", 'Duration', "County") training <- sales2016[index, cols] test <- sales2016 [-index, cols]
I have a training dataset (I know it’s a dataframe/tibble but I’m too used to calling it a dataset) which uses 70% of the data (c700k sales) and a holdout/test dataset of 30% of the data.
Testing the models
Alongside the simple models I’m doing simple tests of the models. I’ll create a confusion matrix of my predictions and then use this to calculate the accuracy.
# Construct confusion matricies conf_t <- table(test$FiveHundredPlus, pred_t) # Calculate the accuracy acc_t <- sum(diag(conf_t))/sum(conf_t)
The best model is the one with the best accuracy.
Model 1: Decision tree, default settings
The first model is a simple decision tree using the package rpart.
tree <- rpart(FiveHundredPlus ~ ., training, method = "class")
Before testing the accuracy, let’s see what this tree looks like:
It’s a pretty simple tree
- If the house is in one of the counties Greater London, Buckinghamshire, Hertfordshire, Surrey, Windsor and Maidenhead or Wokingham then the tree predicts that the house is a detached house (70% of these houses are). Otherwise it is worth less than £500k.
The accuracy of this tree is 0.92 which seems pretty good to me.
Models 2, 3, 4 and 5 - Changing the complexity control
The first tree is pretty simple with just three branches. We can now change the settings so that the algorithm allows splits which have a lower impact on the final complexity.
tree_2 <- rpart(FiveHundredPlus ~ ., training, method = "class",control = rpart.control(cp=0.00025)) tree_3 <- rpart(FiveHundredPlus ~ ., training, method = "class",control = rpart.control(cp=0.00015)) tree_4 <- rpart(FiveHundredPlus ~ ., training, method = "class",control = rpart.control(cp=0.0001)) tree_5 <- rpart(FiveHundredPlus ~ ., training, method = "class",control = rpart.control(cp=0.00001))
Which have progressively more complicated trees:
I’m not going to try to explain what is going on in these trees in words. The question is, which is the most accurate at predicting on the holdout sample?
|Decision Tree 1||0.9201290|
|Decision Tree 2||0.9202014|
|Decision Tree 3||0.9202771|
|Decision Tree 4||0.9204318|
|Decision Tree 5||0.9204054|
Adding the larger amounts of complexity didn’t really do anything so in my opinion the best model is the first tree.