In this article, we’ll be learning about using machine learning on sequential data. Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTM) are two types of networks that could be used for this purpose. Lastly, we’ll implement one TensorFlow model from scratch using the IMDB dataset.

Its applications are very wide including chatbots, translators, text generators, sentiment analysis, speech recognition and so on…

Introduction

First, let’s understand sequential data in detail so that we are on the same page. Sequential data could be any data that is dependent on the previous version of it. For example, text data in communication would require an understanding of a topic in a sequential manner. Another good example is sound data, as we need to remember what someone said earlier to understand the context of the current discussion. So such types of models are highly dependent on the sequence of the data and little tweaks in sequence could show high changes to accuracy.

Prerequisites

Basic understanding of neural networks
Basic understanding of Natural Language Processing (NLP)
Tensorflow 2.0
Python3

RNNs

This type of network has a block of function which receives two inputs, activation and input data and returns an output. Again the network is passing that output to the same block with the next input data recurrently until all the input data are used. Hence, its name is Recurrent neural network. It could be understood better from the following figure:

But this model has one drawback with learning long sequences.

Suppose there is a sentence like this, “Neil was an astronaut. He was also the first person to land on the moon.”. Here, we can clearly understand that ‘He’ in the second sentence represents Neil but that information is not conveyed in RNNs. To tackle this situation we’ll understand GRU.

GRU

As we’d seen above, RNN is not able to memorize the context of conversation which is not suitable for real-time usage. So as a solution to this Gated Recurrent Unit (GRU) was represented. It has a memory cell unit to remember the context of previous sequences.

LSTM

LSTM stands for Long Short Term Memory. It is even an advanced version of GRU. Even though it was invented way before GRU, the complexity of it is higher than GRU. It has multiple gates to handle different parameters. But this increases the calculation overhead on the model and is slow to train compared to GRU which could be considered as a trade-off for high accuracy.

Types of models

These models are categorized into three categories based on the inputs and outputs. They are classified as below:

Many-Many models: These models have multiple inputs and multiple outputs. Translators are one of the examples of such models.
One-Many models: These models have a single input and multiple outputs. Some text generators are such to generate the sequence of text from a single word.
Many-one models: These models have multiple inputs and a single output. Sentiment analysis is an example of such a model that takes a sequence of review text as input and outputs its sentiment.

Into the code

Now, we’ll build a model using Tensorflow for running sentiment analysis on the IMDB movie reviews dataset. The dataset is from Kaggle. It contains 50k reviews with its sentiment i.e. positive or negative. This will be the type of many-one model.

Starting by importing required packages.

1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4import tensorflow as tf
5from tensorflow.keras.preprocessing.text import Tokenizer
6from keras.utils import to_categorical
7from tensorflow.keras.preprocessing.sequence import pad_sequences

Next, we’ll import data from the CSV file downloaded from Kaggle and convert label data into numerical form for easy implementation with TensorFlow.

1df = pd.read_csv("/home/aubergine/Downloads/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
2
3train_set = 45000
4max_len_text = 2000
5
6df['sentiment'] = df['sentiment'].replace('positive', 1)
7df['sentiment'] = df['sentiment'].replace('negative', 0)

Now, the training on text data is not as simple as numerical data. So, convert it to tokenized vectors for every sentence. The following code does exactly that.

1tokenizer = Tokenizer(num_words=10000)
2tokenizer.fit_on_texts(df['review'])
3vector = tokenizer.texts_to_sequences(df['review'])

Performing data split into training and testing using ranges in python.

1train_data = np.array(vector)[:train_set]
2train_label = (np.array(df['sentiment'])[:train_set])
3test_data = np.array(vector)[train_set:]
4test_label = (np.array(df['sentiment'])[train_set:])

Convert these data vectors to padded sequences as valid input to the model.

1padded_train = pad_sequences(train_data, maxlen=max_len_text)
2padded_test = pad_sequences(test_data, maxlen=max_len_text)

Finally, we’ll build a model on top of the Sequential class of Keras. Then add layers for Embedding, LSTM, and Dense for calculations. As you can see, using LSTM as a Bidirectional layer which helps it learn in both the direction. Embedding layer is part of Word Embeddings to understand meaning out of word vectors. In the end, the Dense layer is for converting Bidirectional layer output to binary output with sigmoid activation.

1model = tf.keras.Sequential()
2model.add(tf.keras.layers.Embedding(13000, 16, input_length=max_len_text))
3model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2)))
4model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
5model.summary()

We’ll use binary_crossentropy as loss function and adam as an optimizer for compiling the model.

1model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Our data is too big for training, so we’ll use fewer epochs to train the model but it should be enough to get high accuracy.

1history = model.fit(padded_train, train_label, epochs=4, validation_data=(padded_test, test_label))

This results in 93% training accuracy and 89% testing accuracy in just 4 epochs on 45k rows of reviews. We can try more number of epoch to get even higher accuracy. Below graphs are accuracy and loss for both the versions of the dataset.

Accuracy and Loss Graph for Both Versions of the Dataset

Here the red line represents training data and the blue line represents testing data. This graph is good for only 4 epochs of training. Further, you can use multiple layers of LSTM to increase the complexity and accuracy of the model.

These layers are highly resource consuming. So, choose the number of layers carefully. If you want to add another layer of LSTM, then set return_sequences=True in Keras layer.

Also, read - How to Use Machine Learning to Solve the House Price Prediction Problem.

Conclusion

Long Short Term Memory is the best representation of Sequential models for applications that needs to understand the context of the data.

Download report

Authors

Vivek Padia

Software Engineer

I work with Aubergine Solutions as a Machine Learning engineer. We believe in having a problem-solving attitude. I have worked with several different technologies related to ML and integrating them with cloud-based services.

View Author