'Parallelize st_union from R's sf package

I have some large shapefiles with multiple millions of polygons that I need to dissolve. Depending upon the shapefile I need to either dissolve by group or just use st_union for all. I have been using the st_par function and it has been working great for most sf applications. Though when I use this function on st_union it returns a list and I cannot figure out how to parallize the sf dissolve function st_union.

Any suggestions would be most helpful! Here is a small code snippet to illustrate my point.

library(sf)
library(assertthat)
library(parallel)

us_shp <- "data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp"
if (!file.exists(us_shp)) {
  loc <- "https://www2.census.gov/geo/tiger/GENZ2016/shp/cb_2016_us_state_20m.zip"
  dest <- paste0("data/cb_2016_us_state_20m", ".zip")
  download.file(loc, dest)
  unzip(dest, exdir = "data/cb_2016_us_state_20m")
  unlink(dest)
  assert_that(file.exists(us_shp))
}

usa <- st_read("data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp", quiet= TRUE) %>%
  filter(!(STUSPS %in% c("AK", "HI", "PR")))

test <- usa %>%
  st_par(., st_union, n_cores = 2)
rsf


Solution 1:[1]

I think you can solve your specific problem with a small modification of the original st_par function.
However this is just a quick and bold fix and this might broke the code for other uses of the function.
The author of the function could certainly provide a better fix...

library(parallel)
# Paralise any simple features analysis.
st_par <- function(sf_df, sf_func, n_cores, ...){

    # Create a vector to split the data set up by.
    split_vector <- rep(1:n_cores, each = nrow(sf_df) / n_cores, length.out = nrow(sf_df))

    # Perform GIS analysis
    split_results <- split(sf_df, split_vector) %>%
        mclapply(function(x) sf_func(x), mc.cores = n_cores)

    # Combine results back together. Method of combining depends on the output from the function.
    if ( length(class(split_results[[1]]))>1 | class(split_results[[1]])[1] == 'list' ){
        result <- do.call("c", split_results)
        names(result) <- NULL
    } else {
        result <- do.call("rbind", split_results)
    }

    # Return result
    return(result)
}

Solution 2:[2]

I was trying to use this for st_join and was running into problems with the returned data type. In looking at the result more closely it became evident that the split_results was just a list of sf objects. I ended up modifying the code to use dplyr::bind_rows() to get what I wanted.

There probably needs to be some more logic around the "combine" to deal with different return types but this works for the st_join function.

# Parallelise any simple features analysis.
st_par <- function(sf_df, sf_func, n_cores, ...) {

  # Create a vector to split the data set up by.
  split_vector <- rep(1:n_cores, each = nrow(sf_df) / n_cores, length.out = nrow(sf_df))

  # Perform GIS analysis
  split_results <- split(sf_df, split_vector) %>%
    mclapply(function(x) sf_func(x, ...), mc.cores = n_cores)

  # Combine results back together. Method of combining probably depends on the
  # output from the function. For st_join it is a list of sf objects. This
  # satisfies my needs for reverse geocoding
  result <- dplyr::bind_rows(split_results)

  # Return result
  return(result)
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gilles
Solution 2 Mike Lavender