Summarising data

One of the most frequent tasks I do is summarising data using either proc sql or proc means with code like this:

proc means data=inputdata nway missing noprint; class var1 var2; var var3 var4; output out=outputdata (drop = _type_ _freq_) sum=; run;

Given that I use it in SAS a lot I’m going to assume that I’ll use it in R a lot so it seems like the next sensible thing to learn.

From working through the book “R for SAS and SPSS Users” by Bob Muechen, available here from Amazon, I have learnt that in R is isn’t quite as simple as this but it is still possible and not that difficult.

There is a function in R already called “aggregate” but this renames all of your variables which isn’t great. So Bob Meuchen recommends the “summarize” function in the “Hmisc” package. The syntax is pretty much the same but it doesn’t rename your variables. I have attempted using them both in the code at the bottom of the page using the postcode file I created in a previous post.

Yet there is still a problem with this function. That is that is only summarises by one variable at a time. So to get around this you need to use the function “melt” in the package “Reshape2” to convert your data into a long form and then aggregate it. Before using melt on a large dataframe like my postcode file you need to remember to keep only the by variables and your aggregate variables. Otherwise your new dataframe could be huge and make R crash. As it did with me multiple times before I worked out what was going wrong.

Once you’ve done your melt you can apply the summarize function including the column “variable” in your "llist" and setting the aggregate variable to be “value”. Your aggregated variable then needs to be made wide again using the function “dcast” from the “reshape2” package.

Hopefully in a later post I'll actually show this in use.