Generative Adversarial Networks (GANs) have taken over the machine learning community by storm. Their elegant theoretical foundations and the great results that are continuously improved upon in the computer vision domain make them one of the most active topics of research in Machine Learning in recent years. In fact, Yann Lecun, director of Facebook AI Research said in 2016 that GANs, “and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion”. To get an idea of how much this topic is being explored right now, visit this great blog post.
Although they have been shown to work wonderfully as generative models for images, including pictures of faces and bedrooms, GANs haven’t really been tested extensively on data sets such as the ones a factory would provide you, containing a large amount of measurements from sensors in a production line, for example. Such data sets may even contain time series information that our machine learning models must leverage to make predictions on future events — something which doesn’t happen for static data such as pictures. Applying a generative model to these kinds of data can be useful, for example, if our predictive models need an even larger number of samples to train on to improve its generalization. Also, if we come up with a model that can generate high-quality synthetic data, then it surely must have learned the original data’s underlying structure. And if it has, we can use that representation as a new feature set for our predictive models to exploit!
In this post I will describe some of the GAN architectures that may be useful for data set augmentation, both sample- and feature-wise. Let’s start with the basic GAN.
A GAN is a model made up of two entities: a generator and a discriminator. Here we will consider them both to always be parametrized neural networks: G and D, respectively. The discriminator’s parameters are optimized to maximize the probability of correctly distinguishing real from fake data (from the generator) while the generator’s goal is to maximize the probability of the discriminator failing to classify its fake samples as fake.
The generator network produces samples by taking an input vector z, sampled from what is called a latent distribution, and transforming it by applying the G function defined by the network, yielding G(z). The discriminator network receives alternating G(z) and x, a real data sample and outputs a probability of that input being real.
With proper hyperparameter tuning and enough training iterations, the generator and the discriminator will converge jointly (performing parameter updates via a gradient descent method) to a point where the distribution which described the fake data is the same as the distribution from which real data is sampled.
In the rest of this post the GANs workings will be illustrated on the MNIST dataset for generating new digits, or encoding the original ones on a latent space. We will also look at how they can be used on categorical and time series data.
To start off, here’s a bunch of samples generated by a simple GAN whose neural networks are Multilayer Perceptrons (MLPs), trained on the MNIST dataset:
Although the GAN as we’ve seen it works, in practice it has some drawbacks whose solutions have been object of intensive research since the original 2014 paper by Ian Goodfellow et al. The major drawbacks have to do with the training of the GAN, which has become quite infamous for being extremely difficult: first, training a GAN is highly hyperparameter-dependent. Second, and most importantly, the loss functions (both the generator’s and discriminator’s) are not informative: while the generated samples may start to closely resemble the true data — approximating significantly its distribution — this behavior can’t be indexed to a trend of the losses in general. This means that we can’t just run a hyperparameter optimizer such as skopt using the losses and must instead iteratively tune them manually, which is a shame.
The other drawbacks of this GAN architecture have to do with functionality. The way it’s shown in Figure 1 and using the original cross-entropy loss, we can’t:
Generating categorical data is a particularly difficult problem for GANs. Ian Goodfellow explained it in a very intuitive way in this reddit post:
You can make slight changes to the synthetic data only if it is based on continuous numbers. If it is based on discrete numbers, there is no way to make a slight change.
For example, if you output an image with a pixel value of 1.0, you can change that pixel value to 1.0001 on the next step.
If you output the word “penguin”, you can’t change that to “penguin + .001” on the next step, because there is no such word as “penguin + .001”. You have to go all the way from “penguin” to “ostrich”.
The key idea is that of the impossibility of the generator to go “all the way” from one entity (eg, “penguin”) to another (eg, “ostrich”). Because the space in between has 0 probability of occurring, the discriminator can easily tell that samples in that space are not real, and hence it can not be fooled by the generator.
To solve the issues associated with the original GAN, several other training approaches and architectures have been developed. In the following paragraphs, a brief description of each is presented. The goal of these descriptions is to get a sense of how these methods could be applied to structured data such as the one you would find in a Kaggle competition.
Previously, we’ve set up a GAN to generate random digits that look like the ones from the MNIST data set. But what if we want to generate specific digits? To be able to tell our generator to generate any digit we want by command, only a very small change in training is needed. For each iteration, the generator takes as input not only z but also a one-hot encoded vector indicating the digit. The discriminator input consists then of not only the real or fake sample but also the same label vector.
Proceeding the same way as before but with this slight change of inputs, the Conditional GAN (CGAN) learns to generate samples conditioned on the label it takes as input.
Let’s then generate one sample of each digit! When sampling from the latent space, we also input a one-hot encoded vector indicating the class we want. Doing this for all 10 classes of digits yields the results in Figure 4:
The Wasserstein GAN (WGAN) is one of the most popular GANs and consists of an objective change which results in training stability, interpretability (correlation of the losses with sample quality) and the ability of generating categorical data. The key aspect is that the generator’s objective is to approximate the true data distribution, and for that the choice of distance measure between distributions is important, as that is the objective to minimize. The WGAN chooses the Wasserstein (or Earth-Mover) distance — or rather, an approximation of it — as it can be shown that it converges for sets of distributions for which the Kullback-Leibler and Jensen-Shannon divergences don’t. If you are interested in the theory, read the original paper or this excellent summary.
Implementation-wise, the implications of approximating the Wasserstein distance can be summarized as follows:
The authors of the WGAN paper show that a GAN trained in this way exhibits training stability and interpretability, but only later was it proven that using the Wasserstein distance also provides the GAN with the ability of generating categorical data (i.e., not continuous-valued data like images or even integer-coded data like 1 for Sunday, 2 for Monday and so on). While if the original GAN was trained on this kind of data, the discriminator’s loss would remain low throughout iterations while the generator’s wouldn’t stop increasing, training a WGAN on categorical data is done the same way as on continuous-valued data.
All one needs to do is this (see Figure 5 for an example): for each categorical variable in a data set, have a corresponding softmax output in the generator network of dimensionality equal to the number of possible discrete values. Instead of one-hot encoding the softmax output and using that as input to the discriminator, use the raw softmax output as if it was a set of continuous-valued variables. In this way, training will converge! At test time, to generate fake categorical data, just one-hot encode the generator’s discrete outputs and there you go!
As an example of training a WGAN with gradient penalty on a data set with categorical values, in Figure 6 you can see the beautiful stable, converging loss functions you get when training it on the Sberbank Russian Housing Market data set from a Kaggle competition, which contains both continuous and categorical variables.
Of course, we may also combine the WGAN with the CGAN to train the WGAN in supervised fashion to generate samples conditioned on class labels!
Note: a further improvement on the Wasserstein GAN is the Cramer GAN, which aims at providing even better-quality samples and improved training stability. Inspecting its possibility of generating categorical data is a topic for future research.
Although the WGAN seems to solve quite a lot of our problems, it doesn’t allow the access to latent space representations of the data. Finding these representations may be useful not only for controlling what data to generate by moving continuously in the latent space, but also for feature extraction.
The Bidirectional GAN (BiGAN) is an attempt at solving this issue. It works by learning not only a generative network but also, at the same time, an encoder network E which maps the data to the generator’s latent space. This is done also in an adversarial setting using only one discriminator for both the generating and encoding tasks. The authors of the BiGAN show that, in the limit, the pair G and E yield an autoencoder: encoding a data sample via E and decoding it via G yields the original sample.
We saw earlier that the CGAN allows the conditioning of the generator to generate samples according to their labels. But would it be possible to learn to distinguish digits in a fully unsupervised manner by simply forcing a categorical structure in the GAN’s latent space? What about setting also a continuous code space that we can access to describe continuous semantic variations in data samples (in the case of MNIST, things like digit width or tilt)?
The answer to both questions is yes. Better than that: we can do both things simultaneously! Truth is, we can impose any set of code space distributions that we find useful and train the GAN to encode meaningful traits in those distributions. Each code would learn to contain a different semantic trait of the data, resulting in effective information disentanglement.
The GAN that allows such a thing is the InfoGAN. Simply put, the InfoGAN tries to maximize the mutual information between the generator’s input code space and an inference net’s output. The inference net can be set to simply be an output layer on the discriminator network, sharing all the other parameters, meaning it’s computationally free. Once trained, the InfoGAN’s discriminator inference output layer can be used for feature extraction or, if the code space contains label information, for classification!
Creating an InfoGAN with two code spaces — one continuous of dimension 2 and one discrete of dimension 10 — we can generate data conditioned on the discrete code to generate specific digits, and on the continuous code to generate specifically-styled digits, as seen in Figure 9. Note that no labels were harmed in this entirely unsupervised learning scheme — imposing a categorical distribution in the latent space suffices to make the model learn to encode label information in that distribution!
The Adversarial Autoencoder (AAE) is where autoencoders meet GANs. In this model, two objectives are optimized: the first, the minimization of the reconstruction error of the data x through the encoder and decoder networks, P and Q, respectively. The second training criterion is the enforcement of a prior distribution on the code P(x), via adversarial training where the generator corresponds to P. So while P and Q are optimized to minimize the distance between x and Q(z), where z is the code space vector of the autoencoder, P and D are optimized as a GAN to force the code space P(x) to match a pre-defined structure. This can be seen as a regularization on the autoencoder, forcing it to learn a meaningful, structured and cohesive code space (as opposed to fractured — see page 76 of these lecture notes by Geoffrey Hinton) that allows for effective feature extraction or dimensionality reduction. Also, because a known prior is imposed on the code vector, sampling from such prior and passing the samples through Q, the decoder network, configures a generative modelling scheme!
Let us then impose a 2D Gaussian distribution with standard deviation 5 on the code space via adversarial training on the autoencoder. Sampling from neighboring points in this space, a continuous variation of the generated digits is observed!
Another thing we can do is train the AAE with labels to force the disentanglement of label and digit style information. This way, by fixing the desired label, variations in the imposed continuous latent space will result in different styles of the same digit. For digit number eight for example:
Clearly, there is a meaningful relationship between neighboring points! This property may come in handy when generating samples for our data set augmentation problem.
Often, real-world structured data consists of time series. This is data in which each sample has some dependence on the previous sample. For this type of data, using Recurrent Neural Network (RNN)-based models are often chosen for their intrinsic ability of modelling it. Leveraging these neural networks in our GAN models could, in principle, result in higher-quality samples and features!
Let us then replace the MLPs we used before in our GANs by RNNs as proposed here. In particular, let’s make those RNNs Long Short-Term Memory (LSTM) units (we really are dealing with the buzziest of the buzz words of Deep Learning — oops, I did it again) and try what I very originally called the Waves dataset. This dataset contains 1-D sinusoidal and sawtooth signals of different offsets, frequencies and amplitudes, all with the same number of time steps. From the RNN’s point of view, each sample consists of one wave with 30 time steps.
Let us then run our CGAN with both generator and discriminator networks as LSTM-based neural networks, turning it into an RCGAN that will be trained to learn to generate sinusoidal and sawtooth waves on demand:
After training, we may also inspect how variations in the latent space yield a continuous variation in the characteristics of the generated samples. In particular, if we impose a 2D normally distributed latent space and fix the class label to sinusoidal waves we get the samples shown in Figure 14. There, a clear continuous variation between low and high frequencies and amplitudes is observed, meaning that the RCGAN learned a meaningful latent space!
While using RNNs in GANs is useful for real-valued sequential data generation, it still doesn’t work for discrete sequences, and using the Wasserstein distance using RNNs is not yet a clear option (enforcing the Lipschitz constraint on an RNN is a topic for further research). Some ideas to note that aim at solving this issue are SeqGAN and more recently the ARAE .
We have seen that aside from all the fuss being generated (get it?) around GANs’ ability to generate really cool pictures, some architectures may also be useful for more general machine learning problems containing continuous and discrete-valued data. This post served as an introduction to that idea and was not intended to be a hard comparison between multi-purpose generative models, but it does prove that such a study involving GANs is bound to be done.
Note: the work shown here was developed in its entirety during a summer internship I took at jungle.ai.
Originally published here on October 5, 2017.