The goal of this post is to re-create simplest LSTM-based language model from Tensorflow’s tutorial.

PyTorch is a deeplearning framework based on popular Torch and is actively developed by Facebook. It has implementations of a lot of modern neural-network layers and functions and, unlike, original Torch, has a Python front-end (hence “Py” in the name).

One of the key differences between PyTorch and Tensorflow is that computational graph in PyTorch is dynamic, whereas in Tensorflow it is static. See this post for discussion.

Model and Code

The code is available here. If you aren’t sure what language modeling is, what LSTMs are and how it all fit together, have a look at Tensorflow’s tutorial and my older post where I play with character-level models.

The model is fully defined in the file

import torch.nn as nn
from torch.autograd import Variable

class LM_LSTM(nn.Module):
  """Simple LSMT-based language model"""
  def __init__(self, embedding_dim, num_steps, batch_size, vocab_size, num_layers, dp_keep_prob):
    super(LM_LSTM, self).__init__()
    self.embedding_dim = embedding_dim
    self.num_steps = num_steps
    self.batch_size = batch_size
    self.vocab_size = vocab_size
    self.dp_keep_prob = dp_keep_prob
    self.num_layers = num_layers
    self.dropout = nn.Dropout(1 - dp_keep_prob)
    self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
    self.lstm = nn.LSTM(input_size=embedding_dim,
                            dropout=1 - dp_keep_prob)
    self.sm_fc = nn.Linear(in_features=embedding_dim,

  def init_weights(self):
    init_range = 0.1, init_range), init_range)

  def init_hidden(self):
    weight = next(self.parameters()).data
    return (Variable(, self.batch_size, self.embedding_dim).zero_()),
            Variable(, self.batch_size, self.embedding_dim).zero_()))

  def forward(self, inputs, hidden):
    embeds = self.dropout(self.word_embeddings(inputs))
    lstm_out, hidden = self.lstm(embeds, hidden)
    lstm_out = self.dropout(lstm_out)
    logits = self.sm_fc(lstm_out.view(-1, self.embedding_dim))
    return logits.view(self.num_steps, self.batch_size, self.vocab_size), hidden

def repackage_hidden(h):
  if type(h) == Variable:
    return Variable(
    return tuple(repackage_hidden(v) for v in h)

Personally, I find this a little more readable than Tensorflow’s code. One key difference, is that here, nn.LSTM describes whole multi-layer, multi-step subnework, whereas RNN cells in Tensorflow typically describe one step of computations and need to be wrapped around in some for loop or helper functions such as static_rnn or dynamic_rnn.

Also, this key difference has consequences for performance.

By default, if you run the code like this:

$ python --data=[PATH_TO_DATA]

you should get test perplexity around 78.04. I tried to match Tensorflow’s tutorial, but this isn’t an exact match. While network structure is exactly the same, weight initializations and lr policies are similar but slightly different. However, final testing perplexity is about the same. It will also vary from run to run due to random initializations.


Now, the most interesting part. On GTX 1080 (details on my hardware) I am getting:

  • Original Tensorflow’s tutorial - around 3,800 words per second.
  • Tensorflow’s tutorial where I try using XLA - 3,420 words per second.
  • My PyTorch code - around 7,400 words per second.

Note, that in PyTorch code I am using reader from Tensorflow’s tutorial (with minor adjustments).

I am using Python 3, Tensroflow r1.2.0-rc0 and PyTorch 0.1.12


PyTorch seems to be a very nice framework. I find its code easy to read and because it doesn’t require separate graph construction and session stages (like Tensorflow), at least for simpler tasks I think it is more convinient. In this particular case, PyTorch LSTM is also more than 2x faster. This saves a lot of time even on a small example like this.

PyTorch LSTM network is faster because, by default, it uses cuRNN’s LSTM implementation which fuses layers, steps and point-wise operations. See blog-post on this here.

Tensorflow’s RNNs (in r1.2), by default, does not use cuDNN’s RNN, and RNNCell’s ‘call’ function describes only one time-step of computation. Therefore, a lot of optimization opportunities are lost. On the flip side, though, this gives user much more flexibility - provided that the user knows what he is doing. I am a little surprized that XLA wrapper didn’t seem to help here. Also, Tensorflow has a wrapper around cuDNN’s RNN in tf.contrib, which can potentially be used for speeding up LSTMs in Tensorflow.