'Impute missing data in multivariate time series
I have a problem where I have to predict the sales of 4000 products in 3 months for a certain store. Within the 4000 time series I have many null values and especially many continuous periods of time, for example, a product with 3 months in a row of null values. Can someone tell me some technique to impute these values? I had seen the package in R called mtsdi but I do not understand how it works, if someone has an example I would appreciate it.
Another method I had thought of was to interpolate but having many continuous periods of nulls I don't know if it will work as well.
Solution 1:[1]
Is there a correlation between the various products? If the products are anyway not really correlated, you can use univariate time series imputation algorithms, which only look at inter-time correlations. (e.g. the interpolation you suggested - but there are also more advanced ones, which also account for seasonality).
The imputeTS package offers multiple functions here. (seasonal decomposed interpolation - na_seadec() for example). Here is a nice intro with all functions.
library("imputeTS")
na_seadec(yourData)
Code would look as easy as this example.
But as you correctly assume - the longer the NA gap is, the harder it gets to produce reasonable imputations just by looking a inter-time correlations of one variable. 3 months continuously NA is a lot.
So if there is a some correlation between your products/variables it makes sense to actually use this.
For using mtsdi just look at the example provided in the package documentation:
library(mtsdi)
data(miss)
f <- ~c31+c32+c33+c34+c35
i <- mnimput(formula = f, dataset = miss, eps = 1e-3, ts = TRUE, method = "spline", sp.control = list(df=c(7,7,7,7,7)))
summary(i)
This would be the parameters from the example:
> formula - formula indicating the missing data frame, for instance, ~X1+X2+X3+...+Xp
> dataset - data with missing values to be imputated
> eps - stop criterion
> ts - logical. TRUE if is time series
> method - method for univariate time series filtering. It may be smooth, gam or arima. See Details
> sp.control - list for Spline smooth control. See Details
But I am not too convinced from the mtsdi package and methodology. Seems not very mature to me.
If there is a inter-variable correlation I would rather use the mice package. This is for cross-sectional data, but you can model time aspects partially as additional variables. (e.g. add lag and lead variables)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Steffen Moritz |