'combine rows in data frame containing NA to make complete row
I know this is a duplicate Q but I can't seem to find the post again
Using the following data
df <- data.frame(A=c(1,1,2,2),B=c(NA,2,NA,4),C=c(3,NA,NA,5),D=c(NA,2,3,NA),E=c(5,NA,NA,4))
A B C D E
1 NA 3 NA 5
1 2 NA 2 NA
2 NA NA 3 NA
2 4 5 NA 4
Grouping by A
, I'd like the following output using a tidyverse
solution
A B C D E
1 2 3 2 5
2 4 5 3 4
I have many groups in A
. I think I saw an answer using coalesce
but am unsure how to get it work. I'd like a solution that works with characters
as well. Thanks!
Solution 1:[1]
I haven't figured out how to put the coalesce_by_column
function inside the dplyr
pipeline, but this works:
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
## A B C D E
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 2 5
## 2 2 4 5 3 4
Edit: include @Jon Harmon's solution for more than 2 members of a group
# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
return(dplyr::coalesce(!!! as.list(df)))
}
df %>%
group_by(A) %>%
summarise_all(coalesce_by_column)
#> # A tibble: 2 x 5
#> A B C D E
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 2 5
#> 2 2 4 5 3 4
Solution 2:[2]
We can use fill
to fill all the missing values. And then filter just one row for each group.
library(dplyr)
library(tidyr)
df2 <- df %>%
group_by(A) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
And thanks to @Roger-123, the above code can be further simplified as follows.
df2 <- df %>%
group_by(A) %>%
fill(everything(), .direction = "downup") %>%
slice(1)
Solution 3:[3]
Not tidyverse
but here's one base R solution
df <- data.frame(A=c(1,1),B=c(NA,2),C=c(3,NA),D=c(NA,2),E=c(5,NA))
sapply(df, function(x) x[!is.na(x)][1])
#A B C D E
#1 2 3 2 5
With updated data
do.call(rbind, lapply(split(df, df$A), function(a) sapply(a, function(x) x[!is.na(x)][1])))
# A B C D E
#1 1 2 3 2 5
#2 2 4 5 3 4
Solution 4:[4]
Here is an even more general solution (using unique
, na.omit
to sort of create coalesce
), which can handle more than two rows with overlapping information. Super simply and forward.
> df <- data.frame(A=c(1,1,2,2,2),B=c(NA,2,NA,4,4),C=c(3,NA,NA,5,NA),D=c(NA,2,3,NA,NA),E=c(5,NA,NA,4,4))
> df
A B C D E
1 1 NA 3 NA 5
2 1 2 NA 2 NA
3 2 NA NA 3 NA
4 2 4 5 NA 4
5 2 4 NA NA 4
> df %>% group_by(A) %>% summarise_all(funs( na.omit(unique(.)) ))
# A tibble: 2 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 2 5
2 2 4 5 3 4
Solution 5:[5]
A different tidyverse
possibility could be:
df %>%
gather(var, val, -A, na.rm = TRUE) %>%
group_by(A, var) %>%
distinct(val) %>%
spread(var, val)
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 2 5
2 2 4 5 3 4
Here it, first, performs a wide-to-long data-transformation, excluding the "A" column and removing the missing values. Second, it groups by "A" column and the variable names. Third, it removes the duplicate values. Finally, it returns the data to its original wide format.
Solution 6:[6]
This is functionally identical to @Oriol Mirosa's answer without requiring a custom function:
EDIT: NAs must be omitted as per @thelatemail's comment. This answer was also given by @MrFlick in the duplicate thread linked above.
df %>% group_by(A) %>% summarise_all(~first(na.omit(.)))
I wanted to add to this as it seems to come up regularly for me and I've revisited this thread many times. @Oriol Mirosa's answer works, however I'm resistant to it because it's just complex enough to be difficult to remember (hence my return to this thread).
Personally, I also don't like writing small custom functions like if I don't need to. Attempting to substitute coalesce_by_column
with the actual coalesce
call results in type errors (which I find strange as the rows aren't interacting with each other but whatever). This can be resolved by first doing mutate_all(as.character)
, however my goal here is to minimize syntax so it's easily remembered on the fly.
Furthermore, this substitution changes the behavior such that non-identical values within a column throws an error (why things sometimes behave slightly differently within a function is beyond me). This behavior may be preferred in some situations, however in that case I would recommend @Jerry T's solution as there is no custom function and the ones used are familiar, readable, and the ordering of them (na.omit
and unique
) isn't relevant.
Solution 7:[7]
A simple way is to summarise
by taking the max
across all columns for each group with tidyverse
:
library(tidyverse)
df %>%
group_by(A) %>%
summarise(across(everything(), ~ max(., na.rm = T)))
Output
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 2 5
2 2 4 5 3 4
Another option with using na.omit
and unique
:
df %>%
group_by(A) %>%
summarise(across(everything(), ~ na.omit(unique(.))))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Tung |
Solution 2 | |
Solution 3 | |
Solution 4 | Jerry T |
Solution 5 | tmfmnk |
Solution 6 | |
Solution 7 | AndrewGB |