'How to left_join() two datasets but only select specific columns from one of the datasets?

Here are two datasets: (this is fake data)

library(tidyverse)
myfruit <- tibble(fruit_name = c("apple", "pear", "banana", "cherry"),
                  number = c(2, 4, 6, 8))

fruit_info <- tibble(fruit_name = c("apple", "pear", "banana", "cherry"),
                     colour = c("red", "green", "yellow", "d.red"),
                     batch_number = c(4, 4, 4, 4),
                     type = c("gala", "conference", "cavendish", "bing"),
                     weight = c(10, 11, 12, 13),
                     age_days = c(20, 22, 24, 16))

> myfruit
# A tibble: 4 x 2
  fruit_name number
  <chr>       <dbl>
1 apple           2
2 pear            4
3 banana          6
4 cherry          8

> fruit_info
# A tibble: 4 x 6
  fruit_name colour batch_number type       weight age_days
  <chr>      <chr>         <dbl> <chr>       <dbl>    <dbl>
1 apple      red               4 gala           10       20
2 pear       green             4 conference     11       22
3 banana     yellow            4 cavendish      12       24
4 cherry     d.red             4 bing           13       16

I want to use the dplyr::left_join() function to combine myfruit and fruit_info together, but I only want "batch_number" and "type" columns from fruit_info.

I know I can do this:

new_myfruit <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
  select(fruit_name, number, batch_number, type)

# OR

new_myfruit <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
  select(-colour, -weight, -age_days)

But if I had many more columns in fruit_info and I had to type in many column names into the select() function it would be very time-consuming. So, is there a more efficient way to do this?

Edit:

I've seen examples online where you can do something like this:

new_myfruits  <- left_join(myfruit,
                           fruit_info %>% select(batch_number, type),
                           by = "fruit_name")

But I get an error which says:

# Error: Join columns must be present in data.
#  x Problem with `fruit_name`.

Anyone know what I'm doing wrong?

I would appreciate any help :)



Solution 1:[1]

Try this. You can combine select() with contains() and in the last function add the tags you want to extract, so there is no need of setting each name individually or by column number. Here the code:

library(dplyr)
#Code
newdf <- left_join(myfruit, fruit_info, by = "fruit_name") %>%
  select(contains(c('fruit','number','type')))

Output:

# A tibble: 4 x 4
  fruit_name number batch_number type      
  <chr>       <dbl>        <dbl> <chr>     
1 apple           2            4 gala      
2 pear            4            4 conference
3 banana          6            4 cavendish 
4 cherry          8            4 bing      

Solution 2:[2]

Does this work:

> myfruit %>% left_join(
+   fruit_info %>% select(1,3,4)
+ )
Joining, by = "fruit_name"
# A tibble: 4 x 4
  fruit_name number batch_number type      
  <chr>       <dbl>        <dbl> <chr>     
1 apple           2            4 gala      
2 pear            4            4 conference
3 banana          6            4 cavendish 
4 cherry          8            4 bing      
> 

Solution 3:[3]

I believe you're getting that error because your select statement does not include "fruit_name", so it can't match on that column. Try new_myfruits <- left_join(myfruit, fruit_info %>% select(fruit_name, batch_number, type), by = "fruit_name") instead.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Duck
Solution 2 Karthik S
Solution 3 Kelly