Summarising data

One of the most frequent tasks I do is summarising data using either proc sql or proc means with code like this:

proc means data=inputdata nway missing noprint;
    class var1 var2;
    var var3 var4;
    output out=outputdata (drop = _type_ _freq_) sum=;

Given that I use it in SAS a lot I’m going to assume that I’ll use it in R a lot so it seems like the next sensible thing to learn.
Continue reading

One thing I keep on wanting to get around to looking at is what publicly available data is available on the trains in the UK and the answer to a few questions?

  • What proportions of the trains in the UK are run by foreign governments?
    • At a rough guess it is in the region of 20% based on not very much
    • How would this be measured? Passenger miles?
    • Why can the French, Dutch and German governments run UK rail franchises but not the UK government? It makes no sense!
  • What subsidies are given to the different franchises?
  • Are the original tenders released publicly?
    • Do the train companies live up to these documents?

Continue reading

It’s time to start some analysis, albeit very basis analysis. I want to look at the interaction between the ONS Rural score and the average Broadband speed. This will be done using the postcode file created in my previous post. I’m assuming that the more rural a place is the slower its broadband will be. Is this actually the case?

The aim of this exercise is to learn so R skills not do some rigorous analysis. This means that some rather broad and potentially foolish assumptions will be made with the data to make some things easier to code given my novice R skills.
Continue reading

I thought I’d take a look at the RSS (Royal Statistical Society ) “Statistical Analytics Challenge” after being sent it at work today. It involves analysing eye movement on 60 pictures.

Whilst I’m not going to enter the competition I am going to have a go and see how far I get. My plan goes something like this:

  • Read the image into R
  • Split it into a grid
    • Initially a large grid and then progressively smaller ones
  • Calculate some properties of each of the grid cells
  • Check how each of these properties correlate with where the eye movement points are
  • Check the properties of the surrounding grid cells relative the current cell
  • See how these new properties interact with the eye movements.
  • Do each of the above for a number of pictures to come up with a model and then test this on one of the other pictures.

Continue reading

Following on from my previous post on creating a postcode file with the 2015 general election results I wanted to create a larger file with more variables. Some from the ONS lookups, others from different public datasets. The one I have added so far is the Office of Communications Broadband Coverage dataset from 2013.

The final dataset will contain for each postcode in the UK:

  • The 2015 general election result
  • The Westminster Election Constituency
  • The Easting and Northing coordinates
  • Census lookup areas
  • Rural Indicator
  • Broadband coverage data
  • Which (if any) national park the postcode is in

Continue reading

I was wanting to do some analysis on the results of the 2015 UK general election. In a lot of the datasets I use the only geographical marker I have is the unit postcode. So before I can do any analysis on the results I first need to map all of the UK constituencies onto a postcode list. Luckily for me this can be done using the ONS postcode file which has, amongst other things, the Westminster Electoral Constituencies for all UK unit postcodes. I would normally do this in SAS but since I'm learning R I thought I would do it there first.

First I need to get my data:

  • ONS postcode file

Over the past year I've been having a play mapping things in Excel, not the best tool for doing this I know but it does the trick. Mapping the data directly from Excel has the advantage that I've my data is already in Excel. I would like to do the same exercise in R but that's for another day once I get to know R better.

The method I used is based on tips and methods used on Chandoo and Clear and Simply.

My output looks like this:

Which looks pretty good to me.
Continue reading