This blog provides some guidance on using a recent deep generative model developed by Microsoft researchers at Cambridge for missing-value imputation and shares the best practices to impute missing values in multivariate time-series datasets.
Unlike the datasets available in research, data in the wild is messy, and for various reasons, it may be incomplete. Collection infrastructure (sensors) fail, files--sometimes paper records--get corrupted, and respondents do not answer every question. Some data is, by its nature, sparse; a clinical examination may cover only a fraction of the possible questions. The circumstances require novel techniques to restore the data to a form usable in conventional machine learning models. The good news is the missing value imputation package developed by Microsoft Research, Cambridge UK, automates this by creating a deep learning-based model of missing data.
To model "missingness," one needs to ask, "why is the data missing?" Consider the case of "Missing Completely At Random" (MCAR), where the fact a datum is missing does not depend -- and hence is not predictable -- from the rest of the data. It's as if darts were thrown at random to eliminate some data. Of course, we believe that the missing values are predictable in the rest of the data; that's why imputation makes sense. Imputation solves this problem by learning a model of the missing values from the existing data.
In other cases, the fact that the data is missing is meaningful and can be considered a separate binary-valued variable. See [Koller, 2009] Chapter 19 for a complete discussion. But back to MCAR.
A common practice in tabular data is filling the gaps in each column with zero, mean, or median of that data column. In time-series datasets, repeating or interpolating forward, backward, or both directions is also common. These algorithms, along with more advanced methods such as MICE, are suitable for a handful of cases, especially for missing-completely-at-random (MCAR) and missing-at-random (MAR) patterns.
Microsoft's research team at Cambridge developed a technology based on a partial VAE algorithm, allowing Missing Value Prediction by using probabilistic Deep Learning. The code is open source as a part of the Data-Efficient Decision-Making project on this link.
The team also developed an easy-to-use API, which is currently in private preview. If you are interested in evaluating the API for your scenario, please email the team at azua-request@microsoft.com. The API is easy to use, thus speeding up the process, it is scalable, and it eases the requirement to have deep domain expertise. It also works with different types of data (e.g., continuous values and categorical data) and can handle different missingness patterns.
Multivariate time series data consist of multiple concurrent time-dependent variables, and each variable depends not only on its past values but also on the present and past values of other variables. We need to model this temporal aspect explicitly as predictive features. Comparing EDDI and linear imputation for multivariate timeseries, EDDI is a great choice when our use-case meets the following conditions:
Soft sensor modeling is an interesting multimodal time series use case that aims to model the behavior of a physical sensor network mathematically. A solution template for soft-sensor modeling on Azure is discussed in this blog post. In this section, we add missingness to their scenario and use EDDI to do the imputation. The dataset originated from a sulfur recovery unit (SRU) of a refinery plant in Italy [paper]. You can find the complete explanation of the use-case and dataset on this post and download the datasets from this link. The values are per-minute samples captured from five sensors.
The dataset we are working on is clean, with no missing parts. We intentionally selected that to have a clear ground truth for our experiments and evaluation. The following two are likely scenarios in a real sensor environment:
The missing type in the above scenarios are typically MCAR or MAR, which justifies the usability of EDDI. There are other missingness scenarios that we are not discussing here; for example, when values are not saved due to compression or when sensor readings are not aligned or have different reading rates. EDDI helps us impute the missingness and have a nice full dataset for the downstream task.
EDDI API is similar to the next best question API discussed in this blogpost before. It only involves the first two steps for an imputation task:
In this repository, we have shared the code to use EDDI API in two ways:
[GitHub Repo: Softsensor_MVP_with_EDDI]: This code repository walks you through data preparation, train and batch-inference steps for using EDDI API for the soft sensor modeling show case.
If you decide to use EDDI for multivariate timeseries MVP to prepare filled-in data for a subsequent prediction task, here is a list of best-practices:
Using the above setting to impute given our random 0.7% missingness, one can see EDDI performs better than a linear imputation [Note: lower MAPE error is better]:
In-depth insight: in the above example, we are using immediately adjacent neighbors due to the fine-grain dependency of the sensor values. We could choose a larger window-size, i.e., Xt-k, …, Xt+k, if the temporal dependencies are expected to be more extensive. Also, if we need to capture coarser dependencies, we could use larger k, e.g., Xt-5, Xt+5. Mutual information would give an initial insight into the data predictabilities.
In-depth insight: Here, the 2500 element chunk-length is chosen intentionally to showcase EDDI. The relative performance of EDDI vs. linear imputation depends on the signal shape and the predictability of the features within the missed chunk by other features. Linear imputation performs well for signals with linear dependence on last and next seen values, which is likely in smaller chunks, while better predictability increases EDDI's performance but is slightly harder to know in advance. Domain knowledge is a key component in this case.
Machine Learning with missing values is an old challenge, and EDDI is a novel deep learning-based solution for missing value imputation on multivariate datasets. However, the imputation of a multivariate time-series dataset requires some tweaks to take advantage of both temporal and multivariate signals, which we discussed in this post. Note that one imputation solution does not fit all the missing problems! :smiling_face_with_smiling_eyes: For example, if predictability among variables is very limited or there is too much noise in the data, simpler imputation solutions may work better. Yet, EDDI works well if the missing values usually co-occur with the visible ones that have some predictive signal.
GitHub Repository: Softsensor_MVP_with_EDDI
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.