'dcast warning: ‘Aggregation function missing: defaulting to length’
My df
looks like this:
Id Task Type Freq
3 1 A 2
3 1 B 3
3 2 A 3
3 2 B 0
4 1 A 3
4 1 B 3
4 2 A 1
4 2 B 3
I want to restructure by Id and get:
Id A B … Z
3 5 3
4 4 6
I tried:
df_wide <- dcast(df, Id + Task ~ Type, value.var="Freq")
and got the following warning:
Aggregation function missing: defaulting to length
I can't figure out what to put in the fun.aggregate
. What's the problem?
Solution 1:[1]
The reason why you are getting this warning is in the description of fun.aggregate
(see ?dcast
):
aggregation function needed if variables do not identify a single observation for each output cell. Defaults to length (with a message) if needed but not specified
So, an aggregation function is needed when there is more than one value for one spot in the wide dataframe.
An explanation based on your data:
When you use dcast(df, Id + Task ~ Type, value.var="Freq")
you get:
Id Task A B
1 3 1 2 3
2 3 2 3 0
3 4 1 3 3
4 4 2 1 3
Which is logical because for each combination of Id
, Task
and Type
there is only one value in Freq
. But when you use dcast(df, Id ~ Type, value.var="Freq")
you get this (including a warning message):
Aggregation function missing: defaulting to length
Id A B
1 3 2 2
2 4 2 2
Now, looking back at the top part of your data:
Id Task Type Freq
3 1 A 2
3 1 B 3
3 2 A 3
3 2 B 0
You see why this is the case. For each combination of Id
and Type
there are two values in Freq
(for Id 3: 2
and 3
for A
& 3
and 0
for Type B
) while you can only put one value in this spot in the wide dataframe for each values of type
. Therefore dcast
wants to aggregate these values into one value. The default aggregation function is length
, but you can use other aggregation functions like sum
, mean
, sd
or a custom function by specifying them with fun.aggregate
.
For example, with fun.aggregate = sum
you get:
Id A B
1 3 5 3
2 4 4 6
Now there is no warning because dcast
is being told what to do when there is more than one value: return the sum of the values.
Solution 2:[2]
fun.aggregate
is required when different values inside value.var
column corresponding to identical values - or combination of values - appearing on the LHS of dcast
formula (e.g. "Id"), are forced into one cell by the combination of variables in the RHS of the formula (e.g. "Type").
Defaulting to length()
in dcast
is informative as it
- could suggest existence of coupling in the data, and
- locates
length > 1
cases that may require attention.
More informative would be using function list()
as fun.aggregate
since it shows which value.var
values are involved in each case:
dcast(dt, Id ~ Type, fun.aggregate = list, value.var = 'Freq')
Id A B
1: 3 2,3 3,0
2: 4 3,1 3,3
Basically, table cells have length = 1. Therefore, the defaulting situation in dcast
can be solved by modifying the formula or, by implementing a length-one summarization (aggregation): operators, custom or available functions that give a length-one result in each case and are fit for the purpose.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Dragos Bandur |