Simple machine learning in R - Decision Trees

Now that the land registry data has been imported and had some initial exploratory work done to it lets have a go at making a price prediction model. I’ll use a small subset of the data and initially only try to predict whether or not the house is worth more or less than £500k, rather than the more complicated process of predicting the price. The code used in this post is largely based upon the DataCamp course “Introduction to Machine Learning”. Code for this project is on my GitHub page here. This post focuses on decision trees using the package rpart. Read More

Land Registry Data

My computer has been struggling with some of the code I’ve been trying to run, it is pretty old and doesn’t have enough memory for large datasets in R. So rather than buy a better laptop I’ve set up an Amazon Web Service account and using this guide set up a computer so I don’t have to use mine. I’m only using the free one for now but if I want to have a go at processing something larger this will allow me to pay a small fee to use a more powerful machine for a short period of time.

Summarising data in R

Summarising data

One of the most frequent tasks I do is summarising data using either proc sql or proc means with code like this:

proc means data=inputdata nway missing noprint;
    class var1 var2;
    var var3 var4;
    output out=outputdata (drop = _type_ _freq_) sum=;

Given that I use it in SAS a lot I’m going to assume that I’ll use it in R a lot so it seems like the next sensible thing to learn.
First attempt at simple analysis in R - Part 1

It’s time to start some analysis, albeit very basis analysis. I want to look at the interaction between the ONS Rural score and the average Broadband speed. This will be done using the postcode file created in my previous post. I’m assuming that the more rural a place is the slower its broadband will be. Is this actually the case?

The aim of this exercise is to learn so R skills not do some rigorous analysis. This means that some rather broad and potentially foolish assumptions will be made with the data to make some things easier to code given my novice R skills.
RSS Challenge 2015

I thought I’d take a look at the RSS (Royal Statistical Society ) “Statistical Analytics Challenge” after being sent it at work today. It involves analysing eye movement on 60 pictures.

Whilst I’m not going to enter the competition I am going to have a go and see how far I get. My plan goes something like this:

  • Read the image into R
  • Split it into a grid
    • Initially a large grid and then progressively smaller ones
  • Calculate some properties of each of the grid cells
  • Check how each of these properties correlate with where the eye movement points are
  • Check the properties of the surrounding grid cells relative the current cell
  • See how these new properties interact with the eye movements.
  • Do each of the above for a number of pictures to come up with a model and then test this on one of the other pictures.

