'How to left_join() two datasets but only select specific columns from one of the datasets?
Here are two datasets: (this is fake data)
library(tidyverse)
myfruit <- tibble(fruit_name = c("apple", "pear", "banana", "cherry"),
number = c(2, 4, 6, 8))
fruit_info <- tibble(fruit_name = c("apple", "pear", "banana", "cherry"),
colour = c("red", "green", "yellow", "d.red"),
batch_number = c(4, 4, 4, 4),
type = c("gala", "conference", "cavendish", "bing"),
weight = c(10, 11, 12, 13),
age_days = c(20, 22, 24, 16))
> myfruit
# A tibble: 4 x 2
fruit_name number
<chr> <dbl>
1 apple 2
2 pear 4
3 banana 6
4 cherry 8
> fruit_info
# A tibble: 4 x 6
fruit_name colour batch_number type weight age_days
<chr> <chr> <dbl> <chr> <dbl> <dbl>
1 apple red 4 gala 10 20
2 pear green 4 conference 11 22
3 banana yellow 4 cavendish 12 24
4 cherry d.red 4 bing 13 16
I want to use the dplyr::left_join()
function to combine myfruit
and fruit_info
together, but I only want "batch_number" and "type" columns from fruit_info
.
I know I can do this:
new_myfruit <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
select(fruit_name, number, batch_number, type)
# OR
new_myfruit <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
select(-colour, -weight, -age_days)
But if I had many more columns in fruit_info
and I had to type in many column names into the select()
function it would be very time-consuming. So, is there a more efficient way to do this?
Edit:
I've seen examples online where you can do something like this:
new_myfruits <- left_join(myfruit,
fruit_info %>% select(batch_number, type),
by = "fruit_name")
But I get an error which says:
# Error: Join columns must be present in data.
# x Problem with `fruit_name`.
Anyone know what I'm doing wrong?
I would appreciate any help :)
Solution 1:[1]
Try this. You can combine select()
with contains()
and in the last function add the tags you want to extract, so there is no need of setting each name individually or by column number. Here the code:
library(dplyr)
#Code
newdf <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
select(contains(c('fruit','number','type')))
Output:
# A tibble: 4 x 4
fruit_name number batch_number type
<chr> <dbl> <dbl> <chr>
1 apple 2 4 gala
2 pear 4 4 conference
3 banana 6 4 cavendish
4 cherry 8 4 bing
Solution 2:[2]
Does this work:
> myfruit %>% left_join(
+ fruit_info %>% select(1,3,4)
+ )
Joining, by = "fruit_name"
# A tibble: 4 x 4
fruit_name number batch_number type
<chr> <dbl> <dbl> <chr>
1 apple 2 4 gala
2 pear 4 4 conference
3 banana 6 4 cavendish
4 cherry 8 4 bing
>
Solution 3:[3]
I believe you're getting that error because your select statement does not include "fruit_name", so it can't match on that column. Try new_myfruits <- left_join(myfruit, fruit_info %>% select(fruit_name, batch_number, type), by = "fruit_name")
instead.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Duck |
Solution 2 | Karthik S |
Solution 3 | Kelly |