'Simple aggregation by id variables in R

I'm having a problem with the aggregate function. My data.frame looks like this:

**Region    Sex   SNI      value**
orebro      Man     0       497
orebro      Man     0        1
orebro      Man     1       120
referens    Man     96      3045
referens    Man     96       41
referens    Woman   0       2061
referens    Woman   0       2450

But I want it to look like this:

**  Region      Sex   SNI      value**
    orebro      Man     0       498
    orebro      Man     1       120
    referens    Man     96      3086
    referens    Woman   0       4106

So I want to merge all observations that have identical values for Region, Sex and SNI. I tried:

mydata2 <-aggregate(mydata, by=list(mydata$Region, mydata$Sex, mydata$SNI), 
                FUN=mean, na.rm=TRUE)

But I get the warning message:

50: In mean.default(X[[50L]], ...) :
  argument is not numeric or logical: returning NA

And the resulting dataframe is corrupted. What am I doing wrong? I guess it has something to do with trying to merge strings?



Solution 1:[1]

If aggregate() is given a data.frame as its first argument, then it tries to aggregate every column of that data.frame separately, using FUN(). This means it will run your region, sex, and sni columns through mean(), which is incorrect. Instead, you need to pass just the value column in the first argument, and you need to be careful not to extract the vector (as opposed to retaining the data.frame structure), otherwise you'd lose the column name.

Secondly, your list in your second argument is unnamed, which means the result would lose column names for the grouping columns. You could solve this by explicitly naming them, i.e. list(Region=mydata$Region, ... ), but there's a better way, which is to just index out the grouping columns from the data.frame. This works because data.frames are lists internally.

Here's how it works using 2D indexing:

df <- data.frame(region=c('orebro','orebro','orebro','referens','referens','referens','referens'), sex=c('Man','Man','Man','Man','Man','Woman','Woman'), sni=c(0,0,1,96,96,0,0), value=c(497,1,120,3045,41,2061,2450) );
aggregate(df[,'value',drop=F],by=df[,c('region','sex','sni')],sum,na.rm=T);
##     region   sex sni value
## 1   orebro   Man   0   498
## 2 referens Woman   0  4511
## 3   orebro   Man   1   120
## 4 referens   Man  96  3086

Or, using list indexing:

aggregate(df['value'],by=df[c('region','sex','sni')],sum,na.rm=T);
##     region   sex sni value
## 1   orebro   Man   0   498
## 2 referens Woman   0  4511
## 3   orebro   Man   1   120
## 4 referens   Man  96  3086

Now, there's actually an even better way, which is to use the formula interface of aggregate():

aggregate(value~region+sex+sni,df,sum,na.rm=T);
##     region   sex sni value
## 1   orebro   Man   0   498
## 2 referens Woman   0  4511
## 3   orebro   Man   1   120
## 4 referens   Man  96  3086

Also, you may have noticed that I used sum() instead of mean(). I did this because your expected output has sums rather than averages, although you have an incorrect value for referens/Woman/0.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1