'Is there any way to improve performance (e.g. vectorize) this look-up and recoding problem implemented by a for loop?

I need to make recodings to data sets of the following form.

# List elements of varying length
set.seed(12345)
n  = 1e3
m  = sample(2:5, n, T)
V = list()
for(i in 1:n) {
  for(j in 1:m[i])  
    if(j ==1) V[[i]] = 0 else V[[i]][j] = V[[i]][j-1] + rexp(1, 1/10) 
  }

As an example consider

[1]  0.00000 23.23549 30.10976

Each list element contains a ascending vector of length m, each starting with 0 and ending somewhere in positive real numbers.

Now, consider a value s, where s is smaller than the maximum v_m of each V[[i]]. Also let v_m_ denote the m-1-th element of V[[i]]. Our goal is to find all elements of V[[i]] that are bounded by v_m_ - s and v_m - s. In the example above, if s=5, the desired vector v would be 23.23549. v can contain more elements if the interval encloses more values. As an example consider:

> V[[1]]
[1]  0.000000  2.214964  8.455576 10.188048 26.170458

If we now let s=16, the resulting vector is now 0 2.214964 8.455576, so that it has length 3. The code below implements this procedure using a for loop. It returns v in a list for all n. Note that I also attach the (upper/lower) bound before/afterv, if the bound lead to a reduction in length of v (in other words, if the bound has a positive value).

This loop is too slow in my application because n is large and the procedure is part of a larger algorithm that has to be run many times with some parameters changing. Is there a way to obtain the result faster than with a for loop, for example using vectorization? I know lapply in general is not faster than for.

# Series maximum and one below maximum
v_m  = sapply(V, function(x) x[length(x)])
v_m_ = sapply(V, function(x) x[length(x)-1])

# Set some offsets s
s  = runif(n,0,v_m)

# Procedure
d1  = (v_m_ - s)
d2  = (v_m - s)
if(sum(d2 < 0) > 0) stop('s cannot be larger than series maximum.')

# For loop - can this be done faster?
W = list()
for(i in 1:n){
v = V[[i]]
l = length(v)
v = v[v > d1[i]]
if(l > length(v)) v = c(d1[i], v)
l = length(v)
v = v[v < d2[i]]
if(l > length(v)) v = c(v, d2[i])
W[[i]] = v
}


Solution 1:[1]

I guess you can try mapply like below

V <- lapply(m, function(i) c(0, cumsum(rexp(i - 1, 1 / 10))))
v <- sapply(V, tail, 2)
s <- runif(n, 0, v[1, ])
if (sum(v[2, ] < 0) > 0) stop("s cannot be larger than series maximum.")
W <- mapply(
    function(x, lb, ub) c(pmax(lb,0), x[x >= lb & x <= ub], pmin(ub,max(x))),
    V,
    v[1,]-s,
    v[2,]-s
)

Solution 2:[2]

I don't think vectorization will be an option since the operation goes from a list of unequal-length vectors to another list of unequal-length vectors.

For example, the following vectorizes all the comparisons, but the unlist/relist operations are too expensive (not to mention the final lapply(..., unique)). Stick with the for loop.

W <- lapply(
  relist(
    pmax(
      pmin(
        unlist(V),
        rep.int(d2, lengths(V))
      ),
      rep.int(d1, lengths(V))
    ),
    V
  ),
  unique
)

I see two things that will give modest gains in speed. First, if s is always greater than 0, your final if statement will always evaluate to TRUE, so it can be skipped, simplifying some of the code. Second is to pre-allocate W. These are both implemented in fRecode2 below. A third thing that gives a slight gain is to avoid multiple reassignments to v. This is implemented in fRecode3 below.

For additional speed, move to Rcpp--it will allow the vectors in W to be built via a single pass through each vector element in V instead of two.

set.seed(12345)
n <- 1e3
m <- sample(2:5, n, T)
V <- lapply(m, function(i) c(0, cumsum(rexp(i - 1, 1 / 10))))
v_m <- sapply(V, function(x) x[length(x)])
v_m_ <- sapply(V, function(x) x[length(x)-1])
s <- runif(n,0,v_m)
d1 <- (v_m_ - s)
d2 <- (v_m - s)
if(sum(d2 < 0) > 0) stop('s cannot be larger than series maximum.')

fRecode1 <- function() {
  # original function
  W = list()
  for(i in 1:n){
    v = V[[i]]
    l = length(v)
    v = v[v > d1[i]]
    if(l > length(v)) v = c(d1[i], v)
    l = length(v)
    v = v[v < d2[i]]
    if(l > length(v)) v = c(v, d2[i])
    W[[i]] = v
  }
  W
}

fRecode2 <- function() {
  W <- vector("list", length(V))
  i <- 0L
  for(v in V){
    l <- length(v)
    v <- v[v > d1[i <- i + 1L]]
    if (l > length(v)) v <- c(d1[i], v)
    W[[i]] <- c(v[v < d2[i]], d2[[i]])
  }
  W
}

fRecode3 <- function() {
  W <- vector("list", length(V))
  i <- 0L
  for(v in V){
    idx1 <- sum(v <= d1[i <- i + 1L]) + 1L
    idx2 <- sum(v < d2[i])
    if (idx1 > 1L) {
      if (idx2 >= idx1) {
        W[[i]] <- c(d1[i], v[idx1:idx2], d2[i])
      } else {
        W[[i]] <- c(d1[i], d2[i])
      }
    } else {
      W[[i]] <- c(v[1:idx2], d2[i])
    }
  }
  W
}

microbenchmark::microbenchmark(fRecode1 = fRecode1(),
                               fRecode2 = fRecode2(),
                               fRecode3 = fRecode3(),
                               times = 1e3,
                               check = "equal")
#> Unit: milliseconds
#>      expr    min      lq     mean  median      uq     max neval
#>  fRecode1 2.0210 2.20405 2.731124 2.39785 2.80075 12.7946  1000
#>  fRecode2 1.2829 1.43315 1.917761 1.54715 1.88495 51.8183  1000
#>  fRecode3 1.2710 1.38920 1.741597 1.45640 1.76225  5.4515  1000

Not a huge speed boost: fRecode3 shaves just under a microsecond on average for each vector in V.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 jblood94