First attempt at simple analysis in R - Part 1

It’s time to start some analysis, albeit very basis analysis. I want to look at the interaction between the ONS Rural score and the average Broadband speed. This will be done using the postcode file created in my previous post. I’m assuming that the more rural a place is the slower its broadband will be. Is this actually the case?

The aim of this exercise is to learn so R skills not do some rigorous analysis. This means that some rather broad and potentially foolish assumptions will be made with the data to make some things easier to code given my novice R skills.

To look at this interaction what do I need to do?

  • Put the rural indicator into a numeric form so it is easier to look at. There are a few assumptions I’m making her with this variable, some of which I don’t think are very good assumptions but the point of this exercise is to learn how to use R, not statistical techniques.
    • I’m going to assume that the score for England and Wales has been calibrated with the Scottish score
    • The scores for England and Wales are ordered 1-10, so C1 and C2 become 3 & 4 with C2 being more rural that the ONS D1. I’m not sure if this is true but it will do for now.
  • Look at the distribution of the two scores
    • Histograms will do well here
  • Plot a simple scatter diagram of the two variables to see any correlation
  • Summarise the data down to each rural score and weight by the number of connections
  • Check the correlation between the two. Not sure how to do this yet but will cross this bridge when I get to it.

First of all I’m going to select the columns “AverageSpeed”, “NumberofConnections” and “RuralIndicator” and remove any columns with and “NA” value since these rows are of no use. Then map the Scottish and English & Welsh scores onto a 1-10 numeric scale.

I now want to plot the distributions of these two variables by postcode count (Not number of connections)

To find the average speed for each indicator I first need to find the “total speed” for the area. This is a nonsense variable in real terms but will be useful later when I need to calculate the average speed for each indicator.


A reasonable spread of scores. What about Rural Indicator?


Hmm that isn’t as good but it will have to do. What does it low like when you plot a scatter plot of the two variables? Is it a nice obvious correlation?


No, no it isn’t. Although there are fewer points in the top left corner so there is yet hope that this exercise isn’t totally pointless.

Now I want to aggregate the data by rural indicator and calculate the average broadband speed for each indicator. Finding the distributions in each score could also be interesting.

To be continued…

Leave a Reply

Your email address will not be published. Required fields are marked *