In this post you’ll see how to add sampling step/mode to Tensorflow’s language modeling tutorial. Full code is available here. It is truly amazing how much RNNs can learn from very little data. For fun D. Trump samples scroll down.

Recurrent neural network (as its name implies) is a neural network with a recurrent connection. That is, unlike simpler feed-forward network, it considers it’s previous state in addition to the current input. Because of this modification RNNs are a natural model of choice for modelling many kinds of sequential data: text, speech, audio, video, etc. Sometimes, RNN models (particularly different LSTM flavors) can be unreasonably effective. For a more in-depth explanation and step-by-step walkthrough of RNNs and LSTMs (Long Short Term Memory) I recommend this excellent blog post. Frequently RNNs are used as building blocks in bigger, more complex models, and in such form they helped to establish new state of the art results on speech recognition, machine translation, language modeling and other tasks.

In this post, I’ll start with Tensorflow’s language modeling tutorial and modify it a little bit to make it more fun.

To play with the code below (and deep learning in general) it is highly recommended that you have access to at least one higher end NVIDIA GPU. Here is my post with example hardware. You’ll also need Tensorflow installed.

PTB text and char models

Tensorflow’s language modeling tutorial uses model of a smaller size on a very small Penn Tree Bank dataset. Yet, it is a great introduction to language modelling using RNNs. It’s “large” model takes about 3.5 hours to train on a single GTX 1080.

All the source code is available on GitHub

Language modeling

Language modelling is a task of learning a probability distribution P(w_1, ..., w_n) over a set of all possible word sequences. The goal is to learn such probability distribution P that real sentences will have much higher probability compared to random sets of words. Once such probability distribution is learned, we can use it as a generative model and sample from it to generate new text. Sampling from language model is the most fun part and, sadly it is missing from the Tensorflow’s tutorial. Hence we add it here.

First we modify the model’s graph to include sample generator:

self.sample = tf.multinomial(logits, 1) #this is our sampling operation

Next, we add sampling function which will feed the data and and actually run sampling:

def do_sample(session, model, data, num_samples):
  """Sampled from the model"""
  samples = []
  state = session.run(model.initial_state)
  fetches = [model.final_state, model.sample]
  sample = None
  for x in data:
    feed_dict = {}
    feed_dict[model.input_data] = [[x]]
    for layer_num, (c, h) in enumerate(model.initial_state):
      feed_dict[c] = state[layer_num].c
      feed_dict[h] = state[layer_num].h

    state, sample = session.run(fetches, feed_dict)
  if sample is not None:
    samples.append(sample[0][0])
  else:
    samples.append(0)
  k = 1
  while k < num_samples:
    feed_dict = {}
    feed_dict[model.input_data] = [[samples[-1]]]
    for layer_num, (c, h) in enumerate(model.initial_state):
      feed_dict[c] = state[layer_num].c
      feed_dict[h] = state[layer_num].h
    state, sample = session.run(fetches, feed_dict)
    samples.append(sample[0][0])
    k += 1
  return samples

Notice that this function accepts a seeding sequence and will sample it’s first sample, given the seeding sequence. For all samples after first, it will consider both seeding sequence and previously generated samples. You can generate sampled sequences of arbitrary lengths.

PTB dataset is tiny by modern standards. It only has 887,521 words in the training set with a vocabulary of 10,000 distinct words. I will actually train 2 models using this dataset: 1) one language model using words as input and 2) language model using only characters as input.

PTB word based language model

Here, I just follow Tensorflow’s tutorial and achieve test perplexity of 78.853, which is consistent with results reported in the original paper Zaremba, et. al. Recurrent Neural Network Regularization. Before training the model I added sampling step before every epoch

def pretty_print(items, is_char_model, id2word):
  if not is_char_model:
    return ' '.join([id2word[x] for x in items])
  else:
    return ''.join([id2word[x] for x in items]).replace('_', ' ')



print("Seed: %s" % pretty_print([word_to_id[x] for x in seed_for_sample], config.is_char_model, id_2_word))
print("Sample: %s" % pretty_print(do_sample(session, mtest, [word_to_id[word] for word in seed_for_sample], max(5 * (len(seed_for_sample) + 1), 10)), config.is_char_model, id_2_word))

My seed phrase is: “the balance is supplied” which is found in the original data. Before any training, randomly initialized model, get me something like this:

“influx stretching stein formula sell petco intellectual underwear conglomerate rowe microsoft than audio exactly cardiovascular azoff order boasts usx child-care 26-week petrie commodity misconduct recycling”

As expected, this doesn’t make sense, meaning that our (currently random) probability distribution over word sequences isn’t particularly useful. Let’s see what we get closer to the end of training. Again, the same seed phrase: “the balance is supplied”. After epoch 54 I get:

“by slowing growth jack chips the government 's chief financial officer in detroit said the intention of investment to be produced by citicorp"

and after epoch 55 I get:

” as defendants allow many purchasers to participate in proportion to those who are no greater than a temporary recession to invest he admits the"

Note that, at least grammatically, these look much more like English. For example, after “the balance is supplied”, words “as” or “by” are much more likely than the word “influx” which gramatically does not make sense. Also, the “topic” of the last two phrases are finance/investing, while first one is just random set of words (as it should be).

PTB character based language model

I’ve included the config class I use in the code:

class CharLargeConfig1(object):
  """Large config."""
  is_char_model = True
  optimizer = 'RMSPropOptimizer' # at least initially, seems to produce better results
  init_scale = 0.004
  learning_rate = 0.01
  max_grad_norm = 15
  num_layers = 3 # three layers instead of 2 as in word based model
  num_steps = 128 # we should go back much futher since our step in this case is character, not word
  hidden_size = 512 # this should be plenty for character based model
  max_epoch = 14
  max_max_epoch = 255
  keep_prob = 0.5 # dropout is still important to avoid overfitting
  lr_decay = 1 / 1.15
  batch_size = 16
  vocab_size = 10000 # this will be replaced by actual vocabluary size of only 50

I use the same seed phrase for comparison: “the balance is supplied”. When I sample from randomly initialized model, I get something like this:

usb9xkrd9ruaias$dsaqj’4lmjwyd61\se.lcn6jey0pbco40ab’65<8um324 nqdhm<ufwt#y*/w5bt’nm.zq«2rqm-a2'2mst#u315w&tNwdqNafqh

Which is a random sequence of characters. Amazingly, just after the first epoch (validatio perplexity of 3.64), I get:

to will an apple for a N shares of the practeded to working rudle and a dow listed that scill extressed holding a

After just one epoch the model seemed to have learned English words. While the phrase itself does not make sense, the words in it are undoubtedly English with a typo or two. After epoch 76 I get validation perplexity of 3.076 and the sample looks like this:

president economic spokesman executive for securities was support to put used the sharelike the acquired who pla

Again, undoubtedly English, but not much sense. But the validation perplexity haven’t improved by much since first epoch.

Donald Trump speech generator

I am writing this in USA and currently there are presidential elections going on. This time we have some unusual candidates and I wanted to see how much funny stuff character level RNN can learn from their speeches.

I collected several transcripts of Donald Trump speeches. You can get them from GitHub. All those speeches are intellectual property of Donald and are here for language modeling research purposes only! I use 7 transcripts for the training set and 1 as validation set. Note the this is an extremly tiny training set. Training set contains only words which an order of magnitude smaller than even PTB set. It would be truly amazing if the model could learn something from such a small set. Naturally, I will be training character based, not word based model. I also lower all characters and ignore anything except of the letters and some punctuation characters.

Here is the training config I used:

class CharLargeConfig(object):
  """Large config."""
  is_char_model = True
  optimizer = 'MomentumOptimizer'
  init_scale = 0.004
  learning_rate = 0.05
  max_grad_norm = 15
  num_layers = 3
  num_steps = 100
  hidden_size = 512
  max_epoch = 14
  max_max_epoch = 255
  keep_prob = 0.5
  lr_decay = 1 / 1.15
  #batch_size = 64
  batch_size = 1 # only one because our set, and therefore epochs are too small
  vocab_size = 10000 # again, this will be inferred from training set stats and in practice will be only 32

As my seed phrase for sampling I am using: “make america”. Naturally, randomly initialized model produces random set of characters. But then, we start getting something like this:

a border will be a folk the everyn country, last an lobfer as no

after Epoch 4 (train perplexity of 5.486). This not only looks more like English, but also resonates with topics important for Donald (e.g. border, country). Again, before training the model knows nothing about English or Donald, just characters as input features. Closer to the end of training (epoch 55, train perplexity of 3.029) I am getting something like this:

n plan will be again.the radical islamic steel that establishme

Let’s use final model and sample more text from it. You can give any seed you want as an input and sample as many characters as you’d like

If you’ve trained your model like this:

python ~/repos/rnn_text_writer/ptb_word_lm.py --data_path=DSpeeches --file_prefix=dtrump --seed_for_sample="make america" --model=charlarge --save_path=charlarge

You can sample from it by adding sampling flag:

python ~/repos/rnn_text_writer/ptb_word_lm.py --data_path=DSpeeches --file_prefix=dtrump --seed_for_sample="make america" --model=charlarge --save_path=charlarge --sample_mode=True

So, let’s play with it:

  • seed: “hillary”. Continuation:

national creeship and dillarys for law. protected much to strong are assing radical issue of hillary clinton will end speech taxes. theyre administration almost choa. it is time to federal people.

  • seed: “russia”. Continuation:

n, will not even work at our victory of as now? radical immigrants who start only has been me as federal erol trapped into being shongth mentions.

  • seed: “war on terror”. Continuation:

ism around who are many earels bill closs institute and tonight, they need to couse today, she has been every, and secure the demainst tould came they end by the failed radical islam immigration

  • seed: “build a wall around mexico”. Continuation:

is highated to have the transpacifical great illegal immigrants will be able her signing too administration.this includes now is not a children, theyve reform everyant at our workers

  • seed: “our economy is”. Continuation:

about pulpneedressed delieve nothing to already families under the american people who want americans. we will cause the shifted by the national illegal immigrants shes cannot fwart to protect our

  • and, finally, a bigger sample from RNN Donald, seed: “if i win elections”. Continuation:

the hillary clinton cheated, and we have on our grussics and close.just that how for the convioration will be her its redicentable that. this includes immigration class nearly raxs activity, and terrorism will, sobificial politicians man at sofice. our new good. never runes to obamaclinton, just one who have provide the reprofuction, needs of corations, and brutally friend to do the american people, for it away. which theng me in it, the u.s.you cant not believe. we have an igned to speech. together, we are going to take a members, or doctors with system in order to every other agreements has been so served isis about how our provided. there is how surrend.remember, all up for the radical people their eeerly stuel million projects and crime.this are going to know about her alartpess of anysoming now talking hillary clintonwages?i have come and better east, and me tunn this taxard obama grow and the must make america many

Your actual samples will, obviously, vary due to the fact that samping has some amount of randomness inherent to it.

Conclusion

Recurrent neural are extremly powerful tool for text modeling. I was quite amazed that 3 layer LSTM model was able to learn from such a tiny text (just 21,841 words in D. Trump example)! They can be used for image caption generation, chatbots, question answering and many applications. And wide availability of great open source tools such as Tensorflow (from Google), CNTK (from Microsoft) and others, coupled with amazing compute capabilities of modern GPUs make deep learning extreamly exciting.