A modern form of Recurrent Neural Networks (RNN) is Long Short Term Memory (LSTM) which makes data persevere. It takes care of the gradient vanishing that is faced by RNNs.
Lets understand this with an everyday scenario. When you are reading a book you remember what happened on the last page and then you move forward with that knowledge. The RNN does the same thing; it recollects the previous data and utilizes it for processing the present input. The downside of RNNs is that they don’t remember long term input due to the problem of vanishing gradient. Thus, LSTMs are made to solve this problem.
The high level architecture of LSTM networks is similar to that of an RNN cell. The LSTM is segregated into three parts and each one performs a specific function.
The first is called the forget gate, as it selects whether the incoming information from the earlier time is to be taken into account or is irrelevant. The second part is called the input gate, which recollects and learns new sets of information from the incoming cell. The third is called the output gate where the updated data is passed from the present timestamp to the next timestamp.
Let’s look at a scenario to help understand the working of LSTM networks. Suppose we have two separate sentences. Divya likes to eat. Reena is not fat. From a human perspective, it’s clear that in the first sentence Divya is the point of context and in the next it’s Reena.
As we go to the second sentence our network should also understand the point that the context is changed to Reena. So the forget gate in LTSM makes the network forget such details.
Lets understand these gates in detail and what the role is of each one in the architecture:
Forget Gate :
The main or foremost task in the LSTM architecture is to decide whether the information from an earlier time should be kept or discarded.
Equation of forget gate:
Equation variables:
Xt: Present timestamp input
Ht -1: Earlier timestamp hidden state
Uf: Input weight
Wf: Hidden states weight matrix
After that a sigmoid function is used and it will make Ft a number that ranges from 0 to 1. The Ft is then multiplied to the state of the earlier timestamp.
So when the Ft results to 0 then the network won’t remember the last thing. If the Ft results to 1 then it will remember everything.
Input Gate:
Diya likes to eat. She told me via a call that she ate at McDonald’s last time.
In both the above sentences, the context is about Divya; that she likes to eat and that she ate in McDonald’s. The above information can be added to the cell state but she told me via a call is not that important for the cell state so this statement can be ignored. The procedure in which new information is added to the cell state can be done using input gate.
Structure of input gate:
The process of adding information to the input gate is as follows:
The sigmoid function above is added to the structure to regulate the values that should be added to the cell state. The sigmoid function is a filter for the information that needs to be added from ht-1 and x_t.
A vector is created that consists of all the possible values that should be added. The tanh function is responsible for this, whose outputs range from -1 to +1.
In the end, multiply the sigmoid function (filter) to the tanh function (created vector) and at last add the important information to the cell state using the addition operation.
After the above procedure, we can ensure that only important information is added and redundant information is not.
Output Gate:
Of the information that is fed into the input gate, not all of it needs to be output at a certain time. The task of selecting the important information from the cell state and presenting it as output happens via the output gate.
Structure of output gate:
Below are the functions that the output gate performs:
A vector is created after tanh function is applied to the cell state, and it scales the value to -1 to +1 range.
A filter is made using the h_t-1 and x_t values to regulate the values that have to be output from the vector created in the above step. The filter again uses a sigmoid function.
Then the values of the regulatory filter and the vector created are multiplied to send it out as an output and also to the next cell hidden state.
LSTM Text Generation:
Let’s create a model to test the above theory.
Importing the required libraries:
1 2 3 4 5 6 7 8 9 10 11
Import numpy from keras.layers import Dense from keras.layers import LTSM from keras.layers import Dropout from keras.models import Sequential from keras.utils import np_utils
Loading the text file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Filename = “/Julierceaser.txt” File = (open(filename) .read()) .lower() #characters mapping with integers Chars_unique = sorted(list(set(File))) Char_to_int_dict = {} Int_to_char_dict = {} For i, c in enumerate(Chars_unique): Char_to_int_dict.update({ c: i }) Int_to_char_dict.update({ i: c })
Now we have opened the text file and all the characters are now in lowercase.
Dataset Preparation:
Input and output dataset preparation
Code:
1
2
3
4
5
6
7
8
9
A = []
B = []
For i in range(0, len(File)– 50, 1):
Seq = File[i: i + 50]
Tag = File[i + 50]
A.append(char_to_int_dict[char]
for char in seq])
B.append(char_to_int_dict[tag])
The preparation is done in such a form that the LSTM will predict the “T” in ‘ALPHABET’. We will feed [‘A’,’ ‘L’, ‘P’, ‘H’, ‘A’, ‘,B’, ‘E’] as input and [‘T’] as the expected output.
Reshape A
Code:
1
2
3
A_modified = numpy.reshape(A, len(A), 50, 1))
A_modified = A modified / float(len(chars_unique))
B_modified = np_utils.to_categorical(B)
We scale the values of A_modified between 0 to 1 and one hot encode our true values in B_modified.
Code:
1
2
3
4
5
6
7
8
Lstm_model = Sequential()
Lstm_model.add(LSTM(300, input_shape = (A_modified.shape[1], A_modified.shape[2]), return_sequences = True))
Lstm_model.add(Dropout(0.2))
Lstm_model.add(LSTM(300))
Lstm_model.add(Dropout(0.2))
Lstm_model.add(Dense(B_modified_shape[1], activation = ‘softmax’))
Lstm.model.compile(loss = ‘categorical_crossentropy’, optimizer = ‘adam’)
The first layer consists of 300 memory units and it returns back data in sequence. This is done to make sure that the LSTM next layer receives data in a sequential way, not a scattered one. Dropout layer is applied after every LSTM layer to avoid overfitting the model.
Model Fitting and generation of characters
Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Lstm_model.fit(A_modified, B_modified, epochs = 1, batch_size = 30)
Index_start = numpy.random.randint(0, len(A) - 1)
New_String = A[index_start]
#generation of characters
For i in range(50):
c = numpy.reshape(new_string, (1, len(new_string), 1))
c = c / float(len(chars_unique))
#prediction
Index_pred = numpy.argmax(lstm_model.predict(c, verbose = 0))
Char_out = int_to_char_dict[index_pred]
Seq_in = [int_to_char_dict[value]
for value in new_string]
print(char_out)
new_string.append(index_pred)
new_string = new_string[1: len(new_string)]
The LSTM model is fit in the above scenario over 100 epochs and batch size is 30. A random seed is fixed and we start a generation of characters.
Output:
LSTM is a very good solution to time series and sequence-related problems. The main disadvantage of LSTMs is that they are difficult to train.
I hope this article helped you understand LSTMs and how they work.