Warning: The babysteps series would contain my preliminary explorations into a new area or topic and is prone to mistakes and misconceptions. You are welcome to comment to make my understanding better.

Recently, I participated in a test where I was asked to design an autoencoder. And given that I had barely put my legs into the deep learning domain, I felt a bit overwhelmed by this new idea. Reading a bit about it made me feel nice. Some people claimed it is the DeepLearning's answer to people mocking the need for labelled data to learn. And tons of online tutorials sprang up that would help people to design an autoencoder for dimensionality reduction and anomaly detection.

"Wow", I thought, "This should be the saving grace. Is that why this test on anomaly detection asks me to do autoencoders?".

The architecture of autoencoders is clever and makes it easy to learn things without telling them. So I needed to definitely take a look

So what's an autoencoder?

It is a neural network architecture, or I would say, a neural network super-architecture. What do I mean? We have different types of neural network architectures, convolutional, recurrent, and so on. Autoencoders aren't held hostage to one particular type of architecture, rather, they use these types of neural network as building blocks for their architecture.

Note: An autoencoder is a way to organize neural networks in a particular way to handle special scenarios

What are those special scenarios?

When labeled data is not available for training. So the algorithm is on its own to perform the exploration.
When we not just not have labeled data, but also want to reduce the dimensions of the data.

How does the autoencoder accomplish this task?

By a clever architecture.

The reason why neural networks require tons of training data is that they need to learn everything about the data from scratch¹. The neural nets have no clue what the data is all about unless the labels tell them, and this would then help the neural nets to define a loss function to monitor their own performance. What does this monitoring do? Analyse which weight combinations work such that the neural net models the training data well (not considering regularization, etc.)

So when there are no labels, but just some input data, the neural net can not monitor its performance. That means, bad and good results are considered equal. Autoencoders turn this around and ask, what if I can create a mechanism to reproduce the input? And what if I can use this ability to reproduce the input as a way to create a loss function and monitor the performance?

The autoencoders achieve this by partitioning the neural net into three parts:

Encoder
Code
Decoder

Wikipedia Image of Autoencoder Schema

For a control engineer this kind of translates to:

System identification or Parameter Estimation
System Model
System Simulation

This parallel excited me, though I was aware of the limitations of the parallel. For one, the three steps are disconnected, so to speak, in control theory. But autoencoders bring about a way to put the three modules together. How does the autoencoder do that? For now, it is magic for me. In simple terms, this is what appears to happen in the autoencoders:

The encoders (a neural network with possibly progressively reducing the number of units in the subsequent layer) encode the input data focusing on what is important to learn.
The code is a small dimension neural net layer which forms a condensed model for the input.
The decoders are essentially encoders in reverse (in terms of architecture, not the weights) that help to reconstruct the input by doing what the encoders did in the reverse order.

So autoencoders have solved the labelled training data problem? We can do unsupervised learning with neural nets? So the AI singularity is upon us?

Warning: No

Even though autoencoders seem elegant and powerful, it appears that they are not always classified as part of unsupervised learning algorithms. This is actually confusing because they don't take any inputs and so are technically unsupervised. Some people refer to autoencoders being part of self-supervised learning. The linked article is a bit damning on autoencoders.

So what's the big deal with autoencoders?

Their main claim to fame comes from being featured in many introductory machine learning classes available online. As a result, a lot of newcomers to the field absolutely love autoencoders and can't get enough of them. This is the reason why this tutorial exists! (https://blog.keras.io/building-autoencoders-in-keras.html)

That seems to make autoencoders a joke. This is a blog on Keras website and though it is dated to 2016 (an epoch in the deeplearning world domain),

Otherwise, one reason why they have attracted so much research and attention is because they have long been thought to be a potential avenue for solving the problem of unsupervised learning, i.e. the learning of useful representations without the need for labels. Then again, autoencoders are not a true unsupervised learning technique (which would imply a different learning process altogether), they are a self-supervised technique, a specific instance of supervised learning where the targets are generated from the input data.

But what is encouraging is that autoencoders were/are a prime candidate for unsupervised learning. So my initial impressions weren't wrong. However, it appears that it went out of favour, though I don't know if the apparent resurgence (well, the only resurgence is the interest my test showed for its use for anomaly detection) is well founded. I have to understand it well to say something about it.

What I did was, I built an autoencoder in Keras/Tensorflow for the test. I tried different internal configurations (Dense, LSTM), but things go as follows (this is a generalized version I created so that one can reuse it for different number of layers/units):

input_size = 100
nb_hidden_layer = 2
hidden_layer_size = np.array([50, 25])
code_size = 10

def autoencoder_model(input_size, nb_hidden_layer=1, hidden_layer_size=np.array([50]), code_size=10):
    """
    This is an autoenconder architecture of a neural network with Dense layers with a number of hoices on
    * The number of layers
    * Number of units in each layers (including the code unit)
    """
    keras.backend.clear_session()
    
    input_unit = layers.Input(shape=(input_size,))
    
    dummy_unit = input_unit
    for i in range(0, nb_hidden_layer):
        hidden_unit_encoder = layers.Dense(hidden_layer_size[i], activation='relu')(dummy_unit)
        dummy_unit = hidden_unit_encoder
        
    code_unit = layers.Dense(code_size, activation='relu')(hidden_unit_encoder)
    
    dummy_unit = code_unit    
    for i in range(nb_hidden_layer,0,-1):
        hidden_unit_decoder = layers.Dense(hidden_layer_size[i-1], activation='relu')(dummy_unit)
        dummy_unit = hidden_unit_decoder
    
    output_unit = layers.Dense(input_size, activation='sigmoid')(hidden_unit_decoder)
    
    autoencoder = Model(input_unit, output_unit)
    autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
    print("------ The following Autoencoder Model was created ------")
    autoencoder.summary()
    return autoencoder

I will just try to quickly explain what I did.

First the input layer. It contains the input size/shape (input_unit).
Second a set of layers created using the for loop: The encoder (hidden_unit_encoder)
Third, the code unit (code_unit)
Fourth, the reverse of the encoder using a for loop (hidden_unit_decoder)
Fifth, the output layer. I used sigmoid activation because of the need for that application (output_unit).

Once the neural net is created, I also put them in a specific form that Keras allows one to put in. Then, the model is compiled. The following is what I get when I run this function. The summary() function gives an idea of the created model

X_autoencoder = autoencoder_model(input_size, 2, [30, 16], 8)

------ The following Autoencoder Model was created ------
Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 100)]             0         
_________________________________________________________________
dense (Dense)                (None, 30)                3030      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                496       
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_3 (Dense)              (None, 16)                144       
_________________________________________________________________
dense_4 (Dense)              (None, 30)                510       
_________________________________________________________________
dense_5 (Dense)              (None, 100)               3100      
=================================================================
Total params: 7,416
Trainable params: 7,416
Non-trainable params: 0
_________________________________________________________________

So what did I use this autoencoder for?

An unsupervised anomaly detection algorithm for timeseries data.

Did I do well?

Not sure. But I will share it in another post.

What's my hope?

I think that autoencoders, when tuned well, can perform a good job in learning signals' characteristics. So if we give inputs with a lot of non-anomalous components and a bit of anomalous components, then the reconstruction error would point out to possible anomalies. This looks well and good, though I also see why this can be very restrictive as it might work only within some region around the input data and can't truly be unsupervised.

But application developers don't care if it is technically 'unsupervised' or not, what matters is, for their purpose, it can do well even when input labels are not available. That is perhaps it is interesting for applications in Industry 4.0/ Predictive maintenance, because anomaly detection is central to them.

1. I'm talking in general here, and not the fact that transfer learning has changed things a lot for many neural net applications↩