'Using dplyr and mutate to create new columns based on groups and last n rows

I have the following data frame as an example:

match_id <- c("match_1", "match_1","match_1","match_2","match_2","match_2","match_3","match_3","match_3", "match_4","match_4","match_4")
player_id <- c("player_1", "player_2", "player_3", "player_1", "player_2", "player_3", "player_1", "player_2", "player_3","player_1", "player_2", "player_3")
venue <- c("venue A", "venue A", "venue A", "venue B", "venue B", "venue B", "venue C", "venue C", "venue C","venue C", "venue C", "venue C")
opponent <- c("opponent A", "opponent A", "opponent A", "opponent B", "opponent B", "opponent B", "opponent C", "opponent C", "opponent C","opponent C", "opponent C", "opponent C")
points <- c(5,10,15,1,2,3,5,7,9,11,2,6)

data <- data.frame(match_id, player_id, venue, opponent, points)

I am looking to create new columns that show the last n values in the points column, based on groupings.

My attempt for the first group, using 3 games as an example:

#group by player, give the last n games
library(dplyr)
data <- data %>%
  arrange(player_id,desc(match_id)) %>%
  mutate(last3_games = lag(points, n=3)) %>%
  mutate(last2_games = lag(points, n=2)) %>%
  mutate(last1_games = lag(points, n=1))

Which gives:

 head(data)
  match_id player_id   venue   opponent points last3_games last2_games last1_games
1  match_4  player_1 venue C opponent C     11    NA         NA          NA
2  match_3  player_1 venue C opponent C      6    NA         NA          11
3  match_2  player_1 venue B opponent B      1    NA         11          5
4  match_1  player_1 venue A opponent A      5    11         5           1
5  match_4  player_2 venue C opponent C      2     5         1           5
6  match_3  player_2 venue C opponent C      7     1         5           2
7   match_2  player_2 venue B opponent B     3     5         2           7
8   match_1  player_2 venue A opponent A     10    2         7           2

But I would like the values shown as the last 3 values of points for each player, like so:

   match_id player_id   venue   opponent points last3_games last2_games last1_games
1   match_4  player_1 venue C opponent C     11    1         6           11
2   match_3  player_1 venue C opponent C      6    5         1           6
3   match_2  player_1 venue B opponent B      1    NA        5           1
4   match_1  player_1 venue A opponent A      5    NA        NA          5
5   match_4  player_2 venue C opponent C      2    3         7           2
6   match_3  player_2 venue C opponent C      7    10        3           7
7   match_2  player_2 venue B opponent B      3    NA        10          3
8   match_1  player_2 venue A opponent A     10    NA        NA          10

I also then want to do the same by opponent and venue like so:

#by opponent
       match_id player_id   venue   opponent points last3_opponent last2_opponent last1_opponent
1  match_4  player_1 venue C opponent C     11             NA             5             11
2  match_3  player_1 venue C opponent C      5             NA             NA            5
3  match_2  player_1 venue B opponent B      1             NA             NA            1
4  match_1  player_1 venue A opponent A      5             NA             NA            5
5  match_4  player_2 venue C opponent C      2             NA             7             2
6  match_3  player_2 venue C opponent C      7             NA             NA            7
7  match_2  player_2 venue B opponent B      2             NA             NA            2
8  match_1  player_2 venue A opponent A     10             NA             NA            10

There are many NA's here but this is because I am only showing a small example.

My attempt was to group these again and mutate using lag but it doesn't give the required output:

data <- data %>%
  arrange(player_id,desc(match_id), opponent) %>%
  mutate(last3_opponent = lag(points, n=3)) %>%
  mutate(last2_opponent = lag(points, n=2)) %>%
  mutate(last1_opponent = lag(points, n=1))

The output grouped by opponent AND venue should look like this:

    #by opponent and venue
  match_id player_id   venue   opponent points last3_opp_ven last2_opp_ven last1_opp_ven
1  match_1  player_1 venue A opponent A      5            NA            NA            5
2  match_2  player_1 venue B opponent B      1            NA            NA            1
3  match_4  player_1 venue C opponent C     11            NA             5            11
4  match_3  player_1 venue C opponent C      5            NA            NA            5
5  match_1  player_2 venue A opponent A     10            NA            NA            10
6  match_2  player_2 venue B opponent B      2            NA            NA            2
7  match_4  player_2 venue C opponent C      2            NA            7             2
8  match_3  player_2 venue C opponent C      7            NA            NA            7

Ideally, I am looking for up to the last 10 games (last 10, last 9, last 8, etc.), and as there are many opponents and many vanues, the resulting data frame will have quite a few columns.

Is there an easier way?

I also want to add a column at the end of each which has all of the values combined, for each group, like so (using the first example of games):

   match_id player_id   venue   opponent points last3_combined
1   match_4  player_1 venue C opponent C     11    1,6,11        
2   match_3  player_1 venue C opponent C      6    5,1,6         
3   match_2  player_1 venue B opponent B      1    NA,5,1        
4   match_1  player_1 venue A opponent A      5    NA,NA,5        
5   match_4  player_2 venue C opponent C      2    3,7,2         
6   match_3  player_2 venue C opponent C      7    10,3,7       
7   match_2  player_2 venue B opponent B      3    NA,10,3       
8   match_1  player_2 venue A opponent A     10    NA,NA,10        

Update:

Suppose I want to use another variable (in this case game_x, to group by, but this variable only has two values (1 and 0).

I tried the below:

match_id <- c("match_1", "match_1","match_1","match_2","match_2","match_2","match_3","match_3","match_3", "match_4","match_4","match_4")
player_id <- c("player_1", "player_2", "player_3", "player_1", "player_2", "player_3", "player_1", "player_2", "player_3","player_1", "player_2", "player_3")
venue <- c("venue A", "venue A", "venue A", "venue B", "venue B", "venue B", "venue C", "venue C", "venue C","venue C", "venue C", "venue C")
opponent <- c("opponent A", "opponent A", "opponent A", "opponent B", "opponent B", "opponent B", "opponent C", "opponent C", "opponent C","opponent C", "opponent C", "opponent C")
game_x <- c(1,1,0,1,0,1,1,0,0,1,0,0)
points <- c(5,10,15,1,2,3,5,7,9,11,2,6)

data <- data.frame(match_id, player_id, venue, opponent, game_x, points)

library(data.table)

setDT(data)

f <- function(x,n=3) lapply(n:1,function(i) x[i:(i+length(x)-1)])

data[
  order(player_id, game_x, -match_id),
  c("last3", "last2", "last1"):=f(points,3),
  by=.(player_id, game_x)][]

But it produces this:

    match_id player_id   venue   opponent game_x points last3 last2 last1
 1:  match_1  player_1 venue A opponent A      1      5    NA    NA     5
 2:  match_1  player_2 venue A opponent A      1     10    NA    NA    10
 3:  match_1  player_3 venue A opponent A      0     15    NA    NA    15
 4:  match_2  player_1 venue B opponent B      1      1    NA     5     1
 5:  match_2  player_2 venue B opponent B      0      2    NA    NA     2
 6:  match_2  player_3 venue B opponent B      1      3    NA    NA     3
 7:  match_3  player_1 venue C opponent C      1      5     5     1     5
 8:  match_3  player_2 venue C opponent C      0      7    NA     2     7
 9:  match_3  player_3 venue C opponent C      0      9    NA    15     9
10:  match_4  player_1 venue C opponent C      1     11     1     5    11
11:  match_4  player_2 venue C opponent C      0      2     2     7     2
12:  match_4  player_3 venue C opponent C      0      6    15     9     6

When it should look like this:

    match_id player_id   venue   opponent game_x points last3 last2 last1
 1:  match_4  player_1 venue C opponent C      1     11     1     5    11
 2:  match_3  player_1 venue C opponent C      1      5     5     1     5
 3:  match_2  player_1 venue B opponent B      1      1    NA     5     1
 4:  match_1  player_1 venue A opponent A      1      5    NA    NA     5
 5:  match_4  player_2 venue C opponent C      0      2    NA    NA    NA
 6:  match_3  player_2 venue C opponent C      0      7    10    NA    NA
 7:  match_2  player_2 venue B opponent B      0      2    NA    10    NA
 8:  match_1  player_2 venue A opponent A      1     10    NA    NA    10
 9:  match_4  player_3 venue C opponent C      0      6    NA    NA     3
10:  match_3  player_3 venue C opponent C      0      9    NA    3     NA
11:  match_2  player_3 venue B opponent B      1      3    NA    NA     3
12:  match_1  player_3 venue A opponent A      0     15    NA    NA    NA

What am I doing wrong?

Update 2:

I am attempting to skip the NA's where game_x = 0 (and replace all values where game_x = 0 with NA).

What appears to be happening now is that NA appears in the column for last3, last2, etc. even when game_x = 0. Usng the last 5 as an example:

match_id <- c(1,2,3,4,5)
player_id <- c("player_1", "player_1", "player_1", "player_1", "player_1")
venue <- c("venue A", "venue A", "venue B", "venue B", "venue B")
opponent <- c("opponent A", "opponent B", "opponent A", "opponent C", "opponent C")
game_x <- c(1,1,0,1,0)
points <- c(5,10,15,1,2)

data <- data.frame(match_id, player_id, venue, opponent, game_x, points)

library(data.table)

setDT(data)

f <- function(x,n=3,m=rep(TRUE,length(x))) {
  x[!m] <- NA
  lapply(n:1,function(i) x[i:(i+length(x)-1)])
}

data[
  order(-match_id),
  c("last5", "last4", "last3", "last2", "last1"):=f(points,5,game_x==1),
  by=.(player_id)][order(player_id,-match_id)][]

data[order(-match_id)]

Output:

   match_id player_id   venue   opponent game_x points last5 last4 last3 last2 last1
1:        5  player_1 venue B opponent C      0      2     5    10    NA     1    NA
2:        4  player_1 venue B opponent C      1      1    NA     5    10    NA     1
3:        3  player_1 venue B opponent A      0     15    NA    NA     5       10    NA
4:        2  player_1 venue A opponent B      1     10    NA    NA    NA     5    10
5:        1  player_1 venue A opponent A      1      5    NA    NA    NA    NA     5

Should look like:

   match_id player_id   venue   opponent game_x points last5 last4 last3 last2 last1
1:        5  player_1 venue B opponent C      0      2   NA   NA    NA     NA   NA
2:        4  player_1 venue B opponent C      1      1   NA   NA    5    10     1
3:        3  player_1 venue B opponent A      0     15   NA   NA    NA    NA    NA
4:        2  player_1 venue A opponent B      1     10   NA   NA    NA     5    10
5:        1  player_1 venue A opponent A      1      5   NA   NA    NA    NA     5


Solution 1:[1]

You might be able to do something like this:

  1. Set the data as data.table
setDT(data)
  1. Create a small function that returns a list of vectors, showing the sequential last n points, given a vector as input
f <- function(x,n=3) lapply(n:1,\(i) x[i:(i+length(x)-1)])
  1. Apply that function by the grouping vars of interest, remembering to order first. For example to get the prior points, just by player id, you can use f() like this:
data[
  order(-match_id),
  c("last3", "last2", "last1"):=f(points,3),
  by=player_id][]
  1. If you want to group by venue and opponent as well, do this:
data[
  order(-match_id),
  c("last3", "last2", "last1"):=f(points,3),
  by=.(player_id, venue,opponent)][]

Output (by player_id):

    match_id player_id   venue   opponent points last3 last2 last1
      <char>    <char>  <char>     <char>  <num> <num> <num> <num>
 1:  match_1  player_1 venue A opponent A      5    NA    NA     5
 2:  match_1  player_2 venue A opponent A     10    NA    NA    10
 3:  match_1  player_3 venue A opponent A     15    NA    NA    15
 4:  match_2  player_1 venue B opponent B      1    NA     5     1
 5:  match_2  player_2 venue B opponent B      2    NA    10     2
 6:  match_2  player_3 venue B opponent B      3    NA    15     3
 7:  match_3  player_1 venue C opponent C      5     5     1     5
 8:  match_3  player_2 venue C opponent C      7    10     2     7
 9:  match_3  player_3 venue C opponent C      9    15     3     9
10:  match_4  player_1 venue C opponent C     11     1     5    11
11:  match_4  player_2 venue C opponent C      2     2     7     2
12:  match_4  player_3 venue C opponent C      6     3     9     6

If you want the combined column, you can do this, assuming that you assign the result of the above to r1

r1[, combined:=paste(last3,last2,last1,sep = ","), by=1:nrow(r1)][]

Output:

    match_id player_id   venue   opponent points last3 last2 last1 combined
      <char>    <char>  <char>     <char>  <num> <num> <num> <num>   <char>
 1:  match_1  player_1 venue A opponent A      5    NA    NA     5  NA,NA,5
 2:  match_1  player_2 venue A opponent A     10    NA    NA    10 NA,NA,10
 3:  match_1  player_3 venue A opponent A     15    NA    NA    15 NA,NA,15
 4:  match_2  player_1 venue B opponent B      1    NA     5     1   NA,5,1
 5:  match_2  player_2 venue B opponent B      2    NA    10     2  NA,10,2
 6:  match_2  player_3 venue B opponent B      3    NA    15     3  NA,15,3
 7:  match_3  player_1 venue C opponent C      5     5     1     5    5,1,5
 8:  match_3  player_2 venue C opponent C      7    10     2     7   10,2,7
 9:  match_3  player_3 venue C opponent C      9    15     3     9   15,3,9
10:  match_4  player_1 venue C opponent C     11     1     5    11   1,5,11
11:  match_4  player_2 venue C opponent C      2     2     7     2    2,7,2
12:  match_4  player_3 venue C opponent C      6     3     9     6    3,9,6

Here is the minimal set of code required:

library(data.table)

setDT(data)

f <- function(x,n=3) lapply(n:1,\(i) x[i:(i+length(x)-1)])

data[order(-match_id),c("last3", "last2", "last1"):=f(points,3),by=player_id]
data[, combined:=paste(last3,last2,last1,sep = ","), by=1:nrow(data)]

Update -

The OP now wants to exclude some rows (skip over those rows) under certain conditions. If a mask can be passed to f(), which indicates which rows to include, then, we can adjust f() like this:

f <- function(x,n=3,m=rep(TRUE,length(x))) {
  x[!m] <- NA
  lapply(n:1,function(i) x[i:(i+length(x)-1)])
}

This example uses the above adjusted version of f() to skip over rows where game_x==0

data[
  order(-match_id),
  c("last3", "last2", "last1"):=f(points,3,game_x==1),
  by=.(player_id)][order(player_id,-match_id)][]

Another update!,

Now the OP wants to completely exclude game_x=0 rows.

rbind(
  data[game_x==0], 
  data[game_x==1][
  order(-match_id),
  c("last5", "last4", "last3", "last2", "last1"):=f(points,5),
  by=.(player_id)][order(player_id,-match_id)],
  fill=TRUE
)

Output:

   match_id player_id   venue   opponent game_x points last5 last4 last3 last2 last1
      <num>    <char>  <char>     <char>  <num>  <num> <num> <num> <num> <num> <num>
1:        3  player_1 venue B opponent A      0     15    NA    NA    NA    NA    NA
2:        5  player_1 venue B opponent C      0      2    NA    NA    NA    NA    NA
3:        4  player_1 venue B opponent C      1      1    NA    NA     5    10     1
4:        2  player_1 venue A opponent B      1     10    NA    NA    NA     5    10
5:        1  player_1 venue A opponent A      1      5    NA    NA    NA    NA     5

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1