'Subset data based on variable prefix

I have a large dataset in which the answers to one question are distributed among various columns. However, if the columns belong together, they share the same prefix. I wonder how I can create a subset dataset of each question sorting based on the prefix.

Here is an example dataset. I would like to receive an efficient and easy adaptable solution to create a dataset only containing the values of either question one, two or three.

structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1, 
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1, 
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1, 
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA, 
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA, 
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))


Solution 1:[1]

I found a really intuitive solution using the dplyr package, using the select and starts_with commands. Alternatively, you can also replace the starts_with command with contains, if the you are not identifying the similar variables by a prefix but some other common feature.

Q1 <- Survey %>%
             select(
             starts_with("Question1")
             )
Q2 <- Survey %>%
             select(
             starts_with("Question2")
             )

Q3 <- Survey %>%
             select(
             starts_with("Question3")
             )

Solution 2:[2]

You can use sapply and a function:

list_data <- sapply(c("Question1", "Question2", "Question3"),
      function(x) df[startsWith(names(df),x)], simplify = FALSE)

This will store everything in a list. To get the individual data sets in the global environment as individual objects, use:

list2env(list_data, globalenv())

Output

# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl>      <dbl>      <dbl>
#   1          1         NA          1
# 2          1          1          1
# 3         NA         NA         NA
# 4         NA          1         NA
# 5          1         NA          1
# 6          1          1         NA
# 7          1         NA         NA
# 8         NA          1         NA
# 
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl>      <dbl>
#   1          1         NA
# 2         NA          1
# 3         NA         NA
# 4         NA          1
# 5          1         NA
# 6          1         NA
# 7         NA         NA
# 8         NA         NA
# 
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl>      <dbl>
#   1         NA         NA
# 2         NA         NA
# 3         NA          1
# 4         NA          1
# 5          1         NA
# 6          1         NA
# 7          1         NA
# 8         NA         NA

Solution 3:[3]

I believe the underlying question is about data-formats. Here's a few:

library(tidyverse)
structure(
  list(
    ID = c(1, 2, 3, 4, 5, 6, 7, 8),
    Question1a = c(1,
                   1, NA, NA, 1, 1, 1, NA),
    Question1b = c(NA, 1, NA, 1, NA, 1,
                   NA, 1),
    Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
    Question2a = c(1,
                   NA, NA, NA, 1, 1, NA, NA),
    Question2b = c(NA, 1, NA, 1, NA, NA,
                   NA, NA),
    Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
    Question3b = c(NA,
                   NA, 1, 1, NA, NA, NA, NA)
  ),
  class = c("tbl_df", "tbl", "data.frame"),
  row.names = c(NA, -8L)
) -> square_df

square_df %>% 
  pivot_longer(-ID, 
               names_to = c("Question", "Item"),
               names_pattern = "Question(\\d+)(\\w+)") ->
  long_df
long_df
#> # A tibble: 56 × 4
#>       ID Question Item  value
#>    <dbl> <chr>    <chr> <dbl>
#>  1     1 1        a         1
#>  2     1 1        b        NA
#>  3     1 1        c         1
#>  4     1 2        a         1
#>  5     1 2        b        NA
#>  6     1 3        a        NA
#>  7     1 3        b        NA
#>  8     2 1        a         1
#>  9     2 1        b         1
#> 10     2 1        c         1
#> # … with 46 more rows

long_df %>% 
  na.omit(value) ->
  sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#>       ID Question Item  value
#>    <dbl> <chr>    <chr> <dbl>
#>  1     1 1        a         1
#>  2     1 1        c         1
#>  3     1 2        a         1
#>  4     2 1        a         1
#>  5     2 1        b         1
#>  6     2 1        c         1
#>  7     2 2        b         1
#>  8     3 3        b         1
#>  9     4 1        b         1
#> 10     4 2        b         1
#> # … with 12 more rows

sparse_long_df %>% 
  nest(data = c(ID, Item, value)) ->
  nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#>   Question data             
#>   <chr>    <list>           
#> 1 1        <tibble [12 × 3]>
#> 2 2        <tibble [5 × 3]> 
#> 3 3        <tibble [5 × 3]>

Created on 2022-05-12 by the reprex package (v2.0.1)

Solution 4:[4]

You could also use map to store each dataframe in a list, e.g.

 library(purrr)
  # 3 = number of questions
  map(c(1:3), 
     
     function(x){
       quest <- paste0("Question",x)
       select(df, ID, starts_with(quest))
     })

Output:

[[1]]
# A tibble: 8 x 4
     ID Question1a Question1b Question1c
  <dbl>      <dbl>      <dbl>      <dbl>
1     1          1         NA          1
2     2          1          1          1
3     3         NA         NA         NA
4     4         NA          1         NA
5     5          1         NA          1
6     6          1          1         NA
7     7          1         NA         NA
8     8         NA          1         NA

[[2]]
# A tibble: 8 x 3
     ID Question2a Question2b
  <dbl>      <dbl>      <dbl>
1     1          1         NA
2     2         NA          1
3     3         NA         NA
4     4         NA          1
5     5          1         NA
6     6          1         NA
7     7         NA         NA
8     8         NA         NA

[[3]]
# A tibble: 8 x 3
     ID Question3a Question3b
  <dbl>      <dbl>      <dbl>
1     1         NA         NA
2     2         NA         NA
3     3         NA          1
4     4         NA          1
5     5          1         NA
6     6          1         NA
7     7          1         NA
8     8         NA         NA

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Mossa
Solution 4 Julian