June 22, 2020

You probably clicked this post because you’ve got some data set filled with NaNs and filling them with the mean values unsettles you or just because you want to learn more about it. In this post I will compare the most commonly used methods for missing data imputation, with others that take advantage of some clustering techniques.

There’s much untapped potential in Unsupervised Learning (UL). It is the the art of inferring a function to describe hidden structure from “unlabeled” data. But what does finding structure in data means in the first place? Lets consider the following two examples:

**Blobs**: This is an easy one. Anyone who sees this image perceives that is composed of three distinct families (clusters). If you are fairly familiar with statistics, you probably suspect that three underlying Gaussian distributions originated it. When a new sample arrives, one would infer to which of the families it belongs, just by checking where it gets placed.

**Wavy hi: **This one is harder. There is clearly structure here, but how do I teach a computer to extract it? To make you understand this problem better, imagine if I gather 1000 humans and ask them how many clusters they see in this picture. Probably the most common answer would be 2, though answers like 3, 4 or even 1 would emerge!

So if not even humans would reach a consensus, about the structure of this data how can one teach a computer on how to make some sense of it? The problem here is that there is no universal definition of what a cluster is, or more generally what “structure” is. One can look into a certain aspect of their daily lives and find it structured or not, though this will change according to the environment or the person involved.

Many of the most famous UL algorithms, such as *Hierarchical clustering*, *K-Means*, *Gaussian mixture models* or *Hidden Markov models*, will arrive at different answers for the same problem, and in my modest opinion there is no better or more correct way of finding structure generally (Really? The No Free Lunch Theorem again?).

But enough of this philosophical talk, lets get our hands dirty.

From scikit learn.

From scikit fuzzy.

The clusters produced by the *K-Means* algorithm are usually called “hard”, since a sample either is or is not a member of a particular cluster. This variance of the algorithm is called “soft” or “fuzzy”, where samples have a degree of membership to each cluster. The update of the cluster centroids is done based on those memberships.

From scikit learn.

And from gmm-mml. This package was based on the solution present in Unsupervised learning of finite mixture models. It suggests a solution which integrates estimation and model selection in a single algorithm.

This is a time-series data set with no missing values, therefore missing value imputation was performed artificially.

It is relatively small, with 20560 samples and 7 features, one of them being the occupancy variable (binary classification problem).

Also a time-series data set, belonging to a Kaggle competition closed a couple of months ago.

After merging the training data with data from the Russia’s macro economy and financial sector, one gets 30471 samples with 389 features, one of them being the price to predict (regression problem).

It has 93 columns with missing data, with some having a large ratio of NaNs (>90%).

Data set with 858 samples with 32 features and 4 target variables (different medical test binary outputs) which were converted to one target variable by taking the mode.

It has 26 features with missing data and some also with a large ratio of NaNs (>90%).

We started by removing from the train and test sets all the features that have missing values. Using the remaining features, we fit the clustering method to the train set and predict the clusters for each sample in both sets. With the removed columns we calculate the mean (or mode if categorical) of each feature after being grouped by each cluster label. So now we have the means of the uncompleted features for each cluster.

By “normal filling” I mean that each sample is imputed the mean/mode of the cluster it belongs to.

The weighted method uses the “degrees of belonging” of a sample to each cluster. For instance in the *GMM* the degree are the likelihoods of a sample belonging to each cluster, and in the *K-Means* approach they are based on the distance between the sample and the centroids of the clusters.

Little feature engineering was done to the data sets besides normalization.

For the time-series ones, the time stamp was ordered and converted to the number of seconds, for the Occupancy Detection, and the number of days, for the Housing Market, that occurred since the first sample.

After the imputation method is performed it is scored in the test set using XGBoost. *Negative Log Loss* and the *Mean Squared Log Error* were chosen as score metrics.

Initially the “elbow” or “knee” approach was considered. This approach consists in plotting the scores of the clustering methods for a range of number of clusters and look for the elbow.

For instance, in the previous plot the elbow is in the range of 8 to 12. The drawback is that this method needs human intervention for the choice of the elbow and an automatic way is desired. The efforts made to automatically choose the elbow were not successful therefore a new approach was considered.

By using cross validation a relatively effective though computational expensive method was obtained. How does it work? First pick a classifier, then for a range of number of centroids do the unsupervised imputation and evaluate it with a *K-Fold cross validation* using that classifier. Finally pick the number of centroids that performed better in the cross validation.

In the bar plots the **red line** sets the score obtained by mean imputation for easier comparison.

As previously said this data set has no missing data so artificial imputation was performed.

The features and samples for which the missing data is going to be imputed need to be chosen. The features are specifically chosen and for the samples a probability is set. So if the probability is 0.5 there is a 50% chance that a sample is discarded. Since the samples chosen to impute NaN values vary each time, we averaged the scores of three different runs and in the end the scores are averaged.

For this data set an idea was tested to study how to do clustering imputation on data sets so big that they do not fit in memory. Instead of using the whole data set for the imputation computations one should get randomly a number of samples capable of fitting in memory (for this test I used 5000).

By getting a random number of samples of the whole data set, time structure can still be slightly kept. The bigger the number of samples the more accurate it will be.

Based on the results, filling methods built on clustering tend to perform better than the usual methods.

For the occupancy detection data set the best performing algorithm was the one implemented in the *gmm-mml* package, while for the housing market and cervical cancer data sets was the *K-Means*. Though in the housing market data set the *gmm-mml* one was not implemented, the reason why was the difficulty calculating the co-variance matrices with a large amount of features compared with the number of samples.

When increasing the amount of missing data in the occupancy detection data set, it can be noticed that in the overall, the unsupervised based filling methods tend to perform better when compared to the mean filling. Thus, in features with high percentage of missing values, finding structure in data to help with the filling represents an advantage.

One can notice that as the number of features with missing data increases from 2 to 4 in the occupancy detection data set, and the number of features used for clustering decreases, the unsupervised based filling methods tend to perform a bit better when compared with the mean filling, which is counter-intuitive and can be a characteristic of this particular data set and the chosen features.

Also, as the percentage of missing data increases, the scores distantiate more from the baseline score, which is not surprising.

Between the three implementations of *K-Means* the vanilla one performs better then the others. It is also the one that does less computations per iteration, being then the best choice.

Methods based on *GMM* tend to outperform the *K-Means* ones, which makes sense being *K-Means* an heuristic of *GMM* that works with euclidean distances. Euclidean distances are a good metric for low dimensions, but loose meaning for high dimensional spaces. For more info consult this link. *GMM* is based on the likelihood of samples belonging to the PDFs which is a better metric in higher dimensions.

Although there is no obvious winner for which clustering based imputation algorithm should be chosen on top of the others, my advice is to try one based on *GMM*.

For the right number of mixtures is better to use the cross validation technique, though if it is too computationally expensive metrics like the akaike information criterion (aic) and bayesian information criterion (bic) can be used where one tests for a range of number of mixtures and chooses the one that minimizes the criterion.

There are different ways of calculating co-variance matrices. Here are the two most relevant:

**Diagonal:**Each component has its one diagonal matrix. This works has an heuristic.**Full:**The one used for the tests. Each component has its own general co-variance matrix.

Having a high number of features in a data set, one may not have not enough samples to correctly calculate the co-variance matrices for the *GMM*, or it can just take too much time when in “full” mode, therefore the literature advises the use of diagonal co-variance, which gives a good compromise between quality and model size.

If the data set is too big to fit in memory, one should get a random sample of the train set to perform the clustering.

Mean imputation is not far behind the clustering based imputation methods in performance, therefore it is a method still to be considered.

A new way of arranging the data set could be explored. Instead of discarding the features with missing data, one could fill them with the mean and mode values and then perform clustering on this modified data set. After each sample has been labeled, filling can be performed.

More criteria to get the correct number of clusters or mixtures can be studied. Methods like *NEC* and *ICL* are described has good options in this book (chapter 6).

There are still many unsupervised techniques to be studied and tested (for instance, Hierarchical Clustering with different types of distance metrics), tough a generally better one most probably does not exist, being each case one of its own (no free lunch theorem).

*Originally published in *medium.com/jungle-book* on October, 2017.*

Keep an intelligent pulse on **your** assets.

start small, build big, together creating value from day one

Start Now