LSTM Neural Networks have seen a lot of use recently, both for text and music generation, and for Time Series Forecasting.
Today, we will train a LSTM Neural Network for text generation, so that it can write with H. P. Lovecraft’s style.
In order to train this LSTM, we’ll be using TensorFlow’s Keras API for Python.
As usual, I’ll show you my Python code for every experiment, and the results I obtained. But first, let’s do some explaining.
What are LSTM Neural Networks?
The most vanilla, run-of-the-mill Neural Network, called a Multi-Layer-Perceptron, is just a composition of fully connected layers.
In these models, the input is a vector of features, and each subsequent layer is a set of “neurons”.
Each neuron performs an affine (linear) transformation to the previous layer’s output, and then applies some non-linear function to that result.
The output of a layer’s neurons, a new vector, is fed to the next layer, and so on.
A LSTM (Long Short-term Memory) Neural Network is just another kind of Artificial Neural Network, which falls in the category of Recurrent Neural Networks.
What makes LSTM Neural Networks different from regular Neural Networks is, they have LSTM cells as neurons in some of their layers.
Much like Convolutional Layers help a Neural Network learn about image features, LSTM cells help the Network learn about temporal features in data, something which other Machine Learning models traditionally struggled with.
By temporal data, what I mean is a Neural Network may learn that, for instance, the sentence “the cat ate the mouse.” is very different from “the mouse ate the cat.” even if they both contain the same words.
How do LSTM cells work? I’ll explain it now, though I highly recommend you give those tutorials a chance too.
How do LSTM cells work?
An LSTM layer will contain many LSTM cells.
Each LSTM cell in our Neural Network will only look at a single column of its inputs, and also at the previous column’s LSTM cell’s output.
Normally, we feed our LSTM Neural Network a whole matrix as its input, where each column corresponds to something that “comes before” the next column.
This way, each LSTM cell will have two different input vectors: the previous LSTM cell’s output (which gives it some information about the previous input column) and its own input column.
LSTM Cells in action: an intuitive example.
For instance, if we were training an LSTM Neural Network to predict stock exchange values, we could feed it a vector with a stock’s closing price in the last three days.
The first LSTM cell, in that case, would use the first day as input, and send some extracted features to the next cell.
That second cell would look at the second day’s price, and also at whatever the previous cell learned from yesterday, before generating new inputs for the next cell.
After doing this for each cell, the last one will actually have a lot of temporal information. It will receive, from the previous one, what it learned from yesterday’s closing price, and from the previous two (through the other cells’ extracted information).
This accumulation of temporally extracted features is what we would usually call “context”.
It is possible to experiment with different time windows, and also change how many units (neurons) will look at each day’s data, but this is the general idea.
How LSTM Cells work: the Math.
The actual math behind what each cell extracts from the previous one is a bit more involved.
The “forget gate” is a sigmoid layer, that regulates how much the previous cell’s outputs will influence this one’s.
It takes as input both the previous cell’s “hidden state” (another output vector), and the actual inputs from the previous layer.
Since it is a sigmoid, it will return a vector of “probabilities”: values between 0 and 1.
They will multiply the previous cell’s outputs to regulate how much influence they hold, creating this cell’s state.
For instance, in a drastic case, the sigmoid may return a vector of zeroes, and the whole state would be multiplied by 0 and thus discarded.
This may happen if this layer sees a very big change in the inputs distribution, for example.
Unlike the forget gate, the input gate’s output is added to the previous cell’s outputs (after they’ve been multiplied by the forget gate’s output).
The input gate is the dot product of two different layers’ outputs, though they both take the same input as the forget gate (previous cell’s hidden state, and previous layer’s outputs):
- A sigmoid unit, regulating how much the new information will impact this cell’s output.
- A tanh unit, which actually extracts the new information. Notice tanh takes values between -1 and 1.
The product of these two units (which could, again, be 0, or be exactly equal to the tanh output, or anything in between) is added to this neuron’s cell state.
The LSTM cell’s outputs
The cell’s state is what the next LSTM cell will receive as input, along with this cell’s hidden state.
The hidden state will be another tanh unit applied to this neuron’s state, multiplied by another sigmoid unit that takes the previous layer’s and cell’s outputs (just like the forget gate).
Here’s a visualization of what each LSTM cell looks like, borrowed from the tutorial I just linked:
Now that we’ve covered the theory, let’s move on to some practical uses!
As usual, all of the code is available on GitHub if you want to try it out, or you can just follow along and see the gists.
Training LSTM Neural Networks with TensorFlow Keras
For this task, I used this dataset containing 60 Lovecraft tales, or almost all of his works.
Since he wrote most of his work in the 20s, and he died in 1937, it’s now mostly in the public domain, so it wasn’t that hard to get and doesn’t really count as piracy.
I thought training a Neural Network to write like him would be an interesting challenge.
This is because, on the one hand, he had a very distinct style (with abundant purple prose: using weird words and elaborate language), but on the other he used a very complex vocabulary, and a Network may have trouble understanding it.
For instance, here’s a random sentence from the first tale in the dataset:
At night the subtle stirring of the black city outside, the sinister scurrying of rats in the wormy partitions, and the creaking of hidden timbers in the centuried house, were enough to give him a sense of strident pandemonium
If I can get a Neural Network to write “pandemonium”, then I’ll be impressed.
Preprocessing our data
In order to train an LSTM Neural Network to generate text, we must first preprocess our text data so that it can be consumed by the network.
In this case, since a Neural Network takes vectors as input, we need a way to convert the text into vectors.
For these examples, I decided to train my LSTM Neural Networks to predict the next M characters in a string, taking as input the previous N ones.
For a word-based model that generates text one word at a time instead of one character at a time, check out my Markov Chains for Text Generation model, where I train an Algorithm using Game of Throne’s corpus.
To be able to feed it the N characters, I did a one-hot encoding of each one of them, so that the network’s input is a matrix of CxN elements, where C is the total number of different characters on my dataset.
First, we read the text files and concatenate all of their contents.
We limit our characters to be alphanumerical, plus a few punctuation marks.
We can then proceed to one-hot encode the strings into matrices, where every element of the j-th column is a 0 except for the one corresponding to the j-th character in the corpus.
In order to do this, we first define a dictionary that assigns an index to each character.
Notice how, if we wished to sample our data, we could just make the variable slices smaller.
I also chose a value for SEQ_LENGTH of 50, making the network receive 50 characters and try to predict the next 50.
Training our LSTM Neural Network
In order to train the Neural Network, we must first define it.
This Python code creates an LSTM Neural Network with two LSTM layers, each with 100 units.
Remember each unit has one cell for each character in the input sequence, thus 50.
Here VOCAB_SIZE is just the amount of characters we’ll use, and TimeDistributed is a way of applying a given layer to each different cell, maintaining temporal ordering.
For this model, I actually tried many different learning rates to test convergence speed vs overfitting.
Here’s the code for training:
What you are seeing is what had the best performance in terms of loss minimization.
However, with a binary_cross_entropy of 0.0244 in the final epoch (after 500 epochs), here’s what the model’s output looked like.
Tolman hast toemtnsteaetl nh otmn tf titer aut tot tust tot ahen h l the srrers ohre trrl tf thes snneenpecg tettng s olt oait ted beally tad ened ths tan en ng y afstrte and trr t sare t teohetilman hnd tdwasd hxpeinte thicpered the reed af the satl r tnnd Tev hilman hnteut iout y techesd d ty ter thet te wnow tn tis strdend af ttece and tn aise ecn
There are many good things about this output, and many bad ones as well.
The way the spacing is set up, with words mostly between 2 and 5 characters long with some longer outliers, is pretty similar to the actual word length distribution in the corpus.
I also noticed the letters ‘T’, ‘E’ and ‘I’ were appearing very commonly, whereas ‘y’ or ‘x’ were less frequent.
When I looked at letter relative frequencies in the sampled output versus the corpus, they were pretty similar. It’s the ordering that’s completely off.
There is also something to be said about how capital letters only appear after spaces, as is usually the case in English.
To generate these outputs, I simply asked the model to predict the next 50 characters for different 50 character subsets in the corpus. If it’s this bad with training data, I figured testing or random data wouldn’t be worth checking.
The nonsense actually reminded me of one of H. P. Lovecraft’s most famous tales, “Call of Cthulhu”, where people start having hallucinations about this cosmic, eldritch being, and they say things like:
Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.
Sadly the model wasn’t overfitting that either, it was clearly underfitting.
So I tried to make its task smaller, and the model bigger: 125 units, predicting only 30 characters.
Bigger model, smaller problem. Any results?
With this smaller model, after another 500 epochs, some patterns began to emerge.
Even though the loss function wasn’t that much smaller (at 210), the character’s frequency remained similar to the corpus’.
The ordering of characters improved a lot though: here’s a random sample from its output, see if you can spot some words.
the sreun troor Tvwood sas an ahet eae rin and t paared th te aoolling onout The e was thme trr t sovtle tousersation oefore tifdeng tor teiak uth tnd tone gen ao tolman aarreed y arsred tor h tndarcount tf tis feaont oieams wnd toar Tes heut oas nery tositreenic and t aeed aoet thme hing tftht to te tene Te was noewked ay tis prass s deegn aedgireean ect and tot ced the sueer anoormal -iuking torsarn oaich hnher tad beaerked toring the sars tark he e was tot tech
Tech, the, and, was… small words are where it’s at! It also realized many words ended with common suffixes like -ing, -ed, and -tion.
Out of 10000 words, 740 were “the“, 37 ended in “tion” (whereas only 3 contained without ending in it), and 115 ended in –ing.
Other common words were “than” and “that”, though the model was clearly still unable to produce English sentences.
Even bigger model
This gave me hopes. The Neural Network was clearly learning something, just not enough.
So I did what you do when your model underfits: I tried an even bigger Neural Network.
Take into account, I’m running this on my laptop.
With a modest 16GB of RAM and an i7 processor, these models take hours to learn.
So I set the amount of units to 150, and tried my hand again at 50 characters.
I figured maybe giving it a smaller time window was making things harder for the Network.
Here’s what the model’s output was like, after a few hours of training.
andeonlenl oou torl u aote targore -trnnt d tft thit tewk d tene tosenof the stown ooaued aetane ng thet thes teutd nn aostenered tn t9t aad tndeutler y aean the stun h tf trrns anpne thin te saithdotaer totre aene Tahe sasen ahet teae es y aeweeaherr aore ereus oorsedt aern totl s a dthe snlanete toase af the srrls-thet treud tn the tewdetern tarsd totl s a dthe searle of the sere t trrd eneor tes ansreat tear d af teseleedtaner nl and tad thre n tnsrnn tearltf trrn T has tn oredt d to e e te hlte tf the sndirehio aeartdtf trrns afey aoug ath e -ahe sigtereeng tnd tnenheneo l arther ardseu troa Tnethe setded toaue and tfethe sawt ontnaeteenn an the setk eeusd ao enl af treu r ue oartenng otueried tnd toottes the r arlet ahicl tend orn teer ohre teleole tf the sastr ahete ng tf toeeteyng tnteut ooseh aore of theu y aeagteng tntn rtng aoanleterrh ahrhnterted tnsastenely aisg ng tf toueea en toaue y anter aaneonht tf the sane ng tf the
Pure nonsense, except a lot of “the” and “and”s.
It was actually saying “the” more often than the previous one, but it hadn’t learned about gerunds yet (no -ing).
Interestingly, many words here ended with “-ed” which means it was kinda grasping the idea of the past tense.
I let it go at it a few hundred more epochs (to a total of 750).
The output didn’t change too much, still a lot of “the”, “a” and “an”, and still no bigger structure. Here’s another sample:
Tn t srtriueth ao tnsect on tias ng the sasteten c wntnerseoa onplsineon was ahe ey thet tf teerreag tispsliaer atecoeent of teok ond ttundtrom tirious arrte of the sncirthio sousangst tnr r te the seaol enle tiedleoisened ty trococtinetrongsoa Trrlricswf tnr txeenesd ng tispreeent T wad botmithoth te tnsrtusds tn t y afher worsl ahet then
An interesting thing that emerged here though, was the use of prepositions and pronouns.
The network wrote “I”, “you”, “she”, “we”, “of” and other similar words a few times. All in all, prepositions and pronouns amounted to about 10% of the total sampled words.
This was an improvement, as the Network was clearly learning low-entropy words.
However, it was still far from generating coherent English texts.
I let it train 100 more epochs, and then killed it.
Here’s its last output.
thes was aooceett than engd and te trognd tarnereohs aot teiweth tncen etf thet torei The t hhod nem tait t had nornd tn t yand tesle onet te heen t960 tnd t960 wndardhe tnong toresy aarers oot tnsoglnorom thine tarhare toneeng ahet and the sontain teadlny of the ttrrteof ty tndirtanss aoane ond terk thich hhe senr aesteeeld Tthhod nem ah tf the saar hof tnhe e on thet teauons and teu the ware taiceered t rn trr trnerileon and
I knew it was doing its best, but it wasn’t really going anywhere, at least not quickly enough.
I thought of accelerating convergence speed with Batch Normalization.
However, I read on StackOverflow that BatchNorm is not supposed to be used with LSTM Neural Networks.
If any of you is more experienced with LSTM nets, please let me know if that’s right in the comments!
At last, I tried this same task with 10 characters as input and 10 as output.
I guess the model wasn’t getting enough context to predict things well enough though: the results were awful.
Conclusions so far
It is clear a model trained from scratch on a AWS EC2 instance with such a small corpus cannot learn the intricacies of the English language, much less emulate the style of a writer.
For my next experiment, I thought of using a pretrained model, like GPT-2.
Training a Generative Text Model easily with TextGenRNN
I wanted this to be an easy enough experiment so my dear readers could replicate it at home, so I shelved my GPT-2 pretensions for later (!remind me 2 months) and went for TextGenRNN, a Char-RNN model pretrained on Reddit and HackerNews texts.
From a quick glance, it is clear this model already understands a lot of English context and nuances, and without any extra training to make it fit Lovecraft’s corpus, it can already generate pretty coherent sentences.
The Project’s Readme.md includes a description of the default model’s architecture, which with what we’ve learned so far you should be able to understand directly.
The included pretrained-model follows a neural network architecture inspired by DeepMoji. For the default model, textgenrnn takes in an input of up to 40 characters, converts each character to a 100-D character embedding vector, and feeds those into a 128-cell long-short-term-memory (LSTM) recurrent layer. Those outputs are then fed into another 128-cell LSTM. All three layers are then fed into an Attention layer to weight the most important temporal features and average them together (and since the embeddings + 1st LSTM are skip-connected into the attention layer, the model updates can backpropagate to them more easily and prevent vanishing gradients). That output is mapped to probabilities for up to 394 different characters that they are the next character in the sequence, including uppercase characters, lowercase, punctuation, and emoji. (if training a new model on a new dataset, all of the numeric parameters above can be configured)
The model also uses an Attention Layer, which I did not cover here, but I recommend you read about them on this paper (and if you know of any better material on the subject, please let me know!).
Training TextGenRNN to fit our custom data
Training a TextGenRNN text generator was extremely straightforward. You can see the code in my GitHub project (under the
textgenrnn branch), but this is a snippet with the few commands I needed to use.
Results of using TextGenRNN to Generate Lovecraft-style sentences
Finally, after one epoch this Neural Network writes sentences that actually sound like coherent English, and contain only English words, without nonsense or gibberish.
I chose not to do more training epochs, as the dataset is very small, the model very big and I feared overfitting.
However, upon closer inspection, one can see that, even though local coherence is kept (each individual word or even 3-gram or 4-gram looks pretty reasonable in isolation) the sentences in general are not particularly coherent.
All in all, these results are similar to the ones I would expect from a Markov Chain. However, at least a Markov Chain guarantees that all the words will make individual sense (and not be gibberish/keyboard smashing).
At any rate, here are some examples. At least the words the Neural Network is writing and the “style” of the sentences roughly match Lovecraft’s, even if the results are not globally coherent.
The TextGenRNN librrary also lets us specify a “temperature”, which goes from 0 to 1 and makes it pick “riskier” words vs “less risky” ones. I tried temperatures 0.5, 0.6 and 0.7, getting less coherent results for each.
Temperature 0.5: the first thing were so caused a man below their thing that had been the desert and the deep desert that sometimes the little of Dr. Presiden and Deni. The terrible moment was a close and long and stranger out of the standing beyond the tales of a barrel definite never curious and loathment and sec Temperature 0.6: (...) There was no singular seat to the conclusion of the terror to the prayers (...) Temperature 0.7: the start sensitive substance the army of which all sedeless beneath the scraws of that plose ocean of men shall abriew to the cruwp; and about the ground of the manile will said the regular tall of life. the Edaplow ten the cold moonless was to the parts of the retermores as I mention our account of a farm of time and sense of save of through the Mount of studies college and real of the defense of an act of the end of the unchampriness story of an leare among the time while the top of his condition Varying temperature (supposedly more realistic): in the unknown seemed to the strange the still of the share the structure of the started the things--and that the same and King had some of the horror the light of the stronger the strange the strange The man and the light and the black face, and the strange sense of the precipice.
As you can see, at temperature 0.7 the Network is “babbling” and writing words that don’t exist, like “unchampriness” or “Edaplow”. At temperature 0.5 and lower, however, the model tended to pick a lot of low-entropy words like ‘the” and “of”.
However, it is plain to see that the results from this pretrained model are massively better than those of my poor LSTM model trained from scratch, probably because the Lovecraft corpus is a very small dataset.
I feared trying more than one epoch would make the model overfit the small corpus and “plagiarize” its sentences.
However after 3 epochs, when sampling with varied temperatures, I obtained some of my favorite results:
And the stone stretched and sent a bearded horror behind the strange stars of the hills. And stairs and dead an ancient strange death of the studies of the content.LSTM’s impression of H. P. Lovecraft
While it is clear, looking at other people’s work, that an LSTM Neural Network could learn to write like Lovecraft, my PC is not powerful enough to train a big enough model in a reasonable time from scratch.
Even if it were, Lovecraft’s (and probably any author’s) complete works form too short a corpus for a model to learn English from.
This means for text-generation, we are forced to start from a pretrained model, instead of reinventing the wheel from scratch.
In the future, I’d like to repeat this experiment with an even bigger model. I wonder if using GPT-2, the results would hold sentence-level coherence.
On a different note, I checked, and about 10% of the words in the corpus appear only once.
Is there any good practice I should follow if I removed them before training? Like replacing all nouns with the same one, sampling from clusters, or something? Please let me know! I’m sure many of you are more experienced with LSTM neural networks than I am.
Do you think this would have worked better with a different architecture? Something I should have handled differently? Please also let me know, I want to learn more about this.
Did you find any rookie mistakes on my code? Do you think I’m an idiot for not trying XYZ? Or did you actually find my experiment enjoyable, or maybe you even learned something from this article?
If you want to become a Data scientist, or learn something new, check out my Machine Learning Reading List!
I am sorry that this post was not useful for you!
Let us improve this post!
Would you tell me how I can improve this post?