'Simple aggregation by id variables in R
I'm having a problem with the aggregate function. My data.frame looks like this:
**Region Sex SNI value**
orebro Man 0 497
orebro Man 0 1
orebro Man 1 120
referens Man 96 3045
referens Man 96 41
referens Woman 0 2061
referens Woman 0 2450
But I want it to look like this:
** Region Sex SNI value**
orebro Man 0 498
orebro Man 1 120
referens Man 96 3086
referens Woman 0 4106
So I want to merge all observations that have identical values for Region, Sex and SNI. I tried:
mydata2 <-aggregate(mydata, by=list(mydata$Region, mydata$Sex, mydata$SNI),
FUN=mean, na.rm=TRUE)
But I get the warning message:
50: In mean.default(X[[50L]], ...) :
argument is not numeric or logical: returning NA
And the resulting dataframe is corrupted. What am I doing wrong? I guess it has something to do with trying to merge strings?
Solution 1:[1]
If aggregate()
is given a data.frame as its first argument, then it tries to aggregate every column of that data.frame separately, using FUN()
. This means it will run your region, sex, and sni columns through mean()
, which is incorrect. Instead, you need to pass just the value column in the first argument, and you need to be careful not to extract the vector (as opposed to retaining the data.frame structure), otherwise you'd lose the column name.
Secondly, your list in your second argument is unnamed, which means the result would lose column names for the grouping columns. You could solve this by explicitly naming them, i.e. list(Region=mydata$Region, ... )
, but there's a better way, which is to just index out the grouping columns from the data.frame. This works because data.frames are lists internally.
Here's how it works using 2D indexing:
df <- data.frame(region=c('orebro','orebro','orebro','referens','referens','referens','referens'), sex=c('Man','Man','Man','Man','Man','Woman','Woman'), sni=c(0,0,1,96,96,0,0), value=c(497,1,120,3045,41,2061,2450) );
aggregate(df[,'value',drop=F],by=df[,c('region','sex','sni')],sum,na.rm=T);
## region sex sni value
## 1 orebro Man 0 498
## 2 referens Woman 0 4511
## 3 orebro Man 1 120
## 4 referens Man 96 3086
Or, using list indexing:
aggregate(df['value'],by=df[c('region','sex','sni')],sum,na.rm=T);
## region sex sni value
## 1 orebro Man 0 498
## 2 referens Woman 0 4511
## 3 orebro Man 1 120
## 4 referens Man 96 3086
Now, there's actually an even better way, which is to use the formula interface of aggregate()
:
aggregate(value~region+sex+sni,df,sum,na.rm=T);
## region sex sni value
## 1 orebro Man 0 498
## 2 referens Woman 0 4511
## 3 orebro Man 1 120
## 4 referens Man 96 3086
Also, you may have noticed that I used sum()
instead of mean()
. I did this because your expected output has sums rather than averages, although you have an incorrect value for referens/Woman/0.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |