'Change variable chunk of a Netcdf with R
Regularly I face the same problem when using R to work with big netcdf files (bigger than the computer memory). There is not an obvious way to change the chunk of the data. This is probably the only netcdf common task that I can not figure out how to do it in an efficient way in R. I used to go trough this problem using NCO or nccopy depending the situation. Even CDO has options to copy a nc changing the chunk but much less flexible than the previous tools. I am wondering if there is any efficient way to do it in R.
The following example generates a toy nc chunked as Chunking: [100,100,1]
library(ncdf4)
foo_nc_path=paste0(tempdir(),"/thing.nc")
xvals <- 1:100
yvals <- 1:100
lon <- ncdim_def("longitude", "Km_east", xvals)
lat <- ncdim_def("latitude", "Km_north", yvals)
time <- ncdim_def("Time","hours", 1:1000, unlim=TRUE)
var<- ncvar_def("foo_var", "nothing", list(lon, lat, time), chunksizes=c(100,100,1),
longname="xy chunked numbers", missval=-9)
foo_nc <- nc_create(foo_nc_path, list(var))
data <- array(runif(100*100*1000),dim = c(100,100,1000))
ncvar_put(foo_nc, var, data)
nc_close(foo_nc)
####Check speed
foo_nc <- nc_open(foo_nc_path)
system.time({timestep <- ncvar_get(foo_nc,"foo_var",start = c(1,1,1),count=c(-1,-1,1))})
system.time({timeserie <- ncvar_get(foo_nc,"foo_var",start = c(1,1,1),count=c(1,1,-1))})
As you can see, the read time is much bigger for the timeserie
than fot the timestep
var
The time difference increase exponentially with the size of the .nc.
Does anybody know any way to change the chunk of a nc file in R, whose size is bigger than the computer memory?
Solution 1:[1]
It depends on you purpose. If you need to extract/analyze "map-wise" slices (i.e. on the lat-lon matrix), then keep the chunking strategy on the spatial coordinates. However, if you wish to run a time-wise analysis (such as extracting time series of each grid cell to calculate trends), then my advice is to switch your chunking strategy to the time dimension.
Try re-rerunning your code by replacing chunksizes=c(100,100,1)
with something like, say chunksizes=c(10,10,1000)
. The time series reading becomes much faster that way.
If your code is really slow in R you can try a faster alternative, such as (for example) nccopy
or nco
.
You can re-chunk your netcdf file using a simple nccopy
command like this: nccopy -c time/1000,lat/10,lon/10 input.nc output.chunked.nc
In nco
(which I recommend over nccopy
for this operation), you could do something along the lines:
ncks -O -4 -D 4 --cnk_plc g2d --cnk_dmn lat,10 --cnk_dmn lon,10 --cnk_dmn time,1000 in.nc out.nc
specifying --cnk_dmn
to your specific variables with the chunk size of interest. More examples at http://nco.sourceforge.net/nco.html#Chunking.
Either way, you have to play around a little bit with the different chunk sizes in order to determine what works best for you specific case.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |