Recurrent Neural Networks – Deep Learning Fundamentals

In this blog, we will learn Recurrent Neural Networks. Also, will study every important concepts related to Recurrent Neural Networks. Besides, theory, we will use images for better representation and understanding of Recurrent Neural Networks.

Introduction to Recurrent Neural Networks

Generally, a recurrent neural network is a type of advanced artificial neural network. Also, this ANN involves directed cycles in memory. As this network has the ability to build on earlier types of networks. That contains with fixed-size input vectors and output vectors.

Understanding the Recurrent Neural Networks

Let’s say we have one task. That is to predict the next word in a sentence. To accomplish it, we will try to use a multilayer perceptron. In MLP, we have three layers. Such as an input layer, a hidden layer, and an output layer. As in this, input layer receives the input, the hidden layer activations are applied. Then we finally receive the output.

Further, we have to send these activations to the next hidden layer. Although, these successive activations helps to produce the output. Thus, each hidden layer is characterized by its own weights and biases.

Since hidden layers behave independently, As they have their own weights and activations. Further, the main objective is to identify a relationship between successive points. Can we supply the inputs to hidden layers? Yes, we can!

Here, hidden layers are different. As their weight and bias are different. Although, each layer is independent and we can’t combine them together. Same weights and bias are required to combine the hidden layers.

All layers are combined together as they have same weight and bias. Then, we have to roll all these hidden layers in a single recurrent layer.

So it’s like supplying the input to the hidden layer. At all the time steps weights of the recurrent neuron would be the same since its a single neuron now. So a recurrent neuron stores the state of a previous input and combines with the current input. Further, thereby preserving some relationship of the current input with the previous input.

What can RNNs do?

We can say that RNN has shown great success in many MLP tasks. And the most common type of RNNs we use is LSTMs. That they are too good at capturing the long-term dependencies than vanilla RNNs are.

Why Recurrent Neural Networks?

This network connection often offers so many advantages. They are very helpful in image recognition and context information. As the time steps increase, the unit gets influenced by larger neighborhood. With that information, recurrent networks can watch large regions in the input space. In CNN this ability is limited to units in higher layers. Furthermore, the recurrent connections increase the network depth. While they keep the number of parameters low by weight sharing. Reducing the parameters is also a modern trend of CNN architectures

Additionally, the recurrent connections yield to an ability to handle sequential data. This ability is very useful for many tasks. As for the last point recurrent connections of neurons are biological inspired and they are used for many tasks in the brain. Therefore using such connections can enhance artificial networks and bring interesting behaviors. The last big advantage is that RNN offers some kind of memory, which can be used in many applications.

Training RNNs

Generally, training an RNN is similar to training a traditional Neural Network. We also use the backpropagation algorithm for this. Because the parameters are shared by all time steps in the network. The gradient at each output depends not only on the calculations of the current time step. But also the previous time steps.

For example:
In order to calculate the gradient at we would need to backpropagate 3 steps and sum up the gradients. This is called Backpropagation Through Time (BPTT). If this doesn’t make a whole lot of sense yet, don’t worry, we’ll have a whole post on the gory details. For now, that vanilla RNNs trained with BPTT have difficulties learning long-term dependencies. Due to what is called the vanishing/exploding gradient problem. There exists some machinery to deal with these problems. Also, certain types of RNNs (like LSTMs) were specifically designed to get around them.

The training of almost all networks is done by back-propagation. But with the recurrent connection, it has to be adapted. This is simply done by unfolding the net like. It is shown that the network consists of one recurrent layer and one feed forward layer. The network can be unfolded to k instances off.

In the example, in figure the network is unfolded with a depth of k = 3. After unfolding, the network can be trained in the same way as an FFD with Backpropagation. We have to except that each epoch has to run through each unfolded layer. The algorithm for recurrent nets is then called Backpropagation through time (BPTT).

RNN Extensions

Over the years researchers have developed more sophisticated types of RNNs. That is to deal with some of the shortcomings of the vanilla RNN model.

a. Bidirectional RNNs

These are based on the idea that the output at a time may not only depend on the previous elements in the sequence. But also future elements.

For example:
To predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output is then computed based on the hidden state of both RNNs.

b. Deep (Bidirectional) RNNs

These are similar to Bidirectional RNNs, only that we now have multiple layers per time step. In practice, this gives us a higher learning capacity.

c. LSTM networks

LSTMs don’t have a different architecture from RNNs. But they use a different function to compute the hidden state.

The memory in LSTMs is called cells. Internally these cells decide what to keep in memory. They then combine the previous state, the current memory, and the input.

It turns out that these types of units are very efficient at capturing long-term dependencies.

LSTMs can be quite confusing in the beginning. But if you’re interested in learning more this post has an excellent explanation.

Advantages of RNN

a. Store Information

The RNN can use the feedback connection. That is to store information over time in form of activations. This ability is significant for many applications. In the recurrent networks are described that they have some form of memory.

b. Learn Sequential Data

The RNN can handle sequential data of arbitrary length. On the left, the default FFN is shown which can just compute one fixed-size input to one fixed size output. With the recurrent approach also one too many, many to one and many to many inputs to outputs are possible.

One example for one to many networks is that you label an image with a sentence. The many to one approach could handle a sequence of image. They produce one sentence for it. And finally the many to many approaches can be used for language translations. Other use cases for the many to many approaches could be to label each image of a video sequence.

Applications of RNN

Particularly, RNNs are useful in training on any type of sequential data.
For example:
It would make sense to use an RNN. Such as for image/video captioning, word prediction, translation, image processing. However, an RNN can also suit to be trained on non-sequential data in a non-sequential manner. Not too long ago I implemented an RNN for a computational neuroscience project. In case you want to implement your very first RNN, here are a some of tips:

a. Unfold your network

This allows you to visualize network interacts with itself at adjacent time-steps. But it also allows you to visualize how the error is back-propagated through the system (BPTT).

The rule of thumb was any connection, at time step ‘t’,. That isn’t feed-forward should be connected to the next time step at ‘t+1’.

b. Keep track of your back-propagated errors

Don’t duplicate parameters. Use one set of weights for all your states (time-steps). This ensures that you are using a minimal amount of memory. And your weights for each state is the same across all states.

c. They are used in speech processing, non-Markovian control, and music composition. In addition, RNN is used successfully for sequential data. Such as handwriting recognition and speech recognition.

d. The advantage in comparison to FFD is, that RNN can handle sequential data.
A single RNN is proposed for sequence labeling. Most successful applications of RNN refer to tasks like handwriting recognition and speech recognition.

e. They are also used in for Clinical decision support systems. They used a network based on the Jordan/Elman neural network. Furthermore, in a recurrent fuzzy for control of dynamic systems is proposed as a newer application which uses combinations of RNN with CNN.

f. A great application is in collaboration with Natural Language Processing (NLP). RNNs have been demonstrated by many people on the internet. It can represent a language model. These language models can take input such as a large set of Shakespeare’s poems. And after training these models they can generate their own Shakespearean poems that is very hard to differentiate from originals!