Let me start by saying that Torchtext is one of the most underrated tools out there. If you’re doing projects in NLP, you most likely know by now that most of your time is spent creating a boiler code that performs batching, padding, and converting (strings to integers/integers to string). Torchtext is a tool by PyTorch that will save you time by automating these utility tasks.

Getting Started

To install PyTorch, see installation instructions on the PyTorch website.

To install TorchText:

pip install torchtext

We’ll also make use of spaCy to tokenize our data. To install spaCy, follow the instructions here making sure to install both the English and German models or any language with:

python -m spacy download en
python -m spacy download de


One of the main concepts of TorchText is the Field. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either “pos” or “neg”.

The parameters of a Field specify how the data should be processed. We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment. Our TEXT field has tokenize='spacy' as an argument. This defines that the “tokenization” (the act of splitting the string into discrete “tokens”) should be done using the spaCy tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels.

The fields know what to do when given raw data. Now, we need to tell the fields what data it should work on. This is where we use Datasets. There are various built-in Datasets in torchtext that handle common data formats. For csv/tsv files, the TabularDataset class is convenient.

Name Description Use Case
TabularDataset Takes paths to csv/tsv files and json files or Python dictionaries as inputs. Any problem that involves a label (or labels) for each piece of text
LanguageModelingDataset Takes the path to a text file as input. Language modeling
TranslationDataset Takes a path and extensions to a file for each language.e.g. If the files are English: “hoge.en”, French: “”, path=”hoge”, exts=(“en”,”fr”) Translation
SequenceTaggingDataset Takes a path to a file with the input sequence and output sequence separated by tabs. Sequence tagging


The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.In torchvision and PyTorch, the processing and batching of data is handled by DataLoaders. For some reason, torchtext has renamed the objects that do the exact same thing to Iterators. The basic functionality is the same, but Iterators, as we will see, have some convenient functionality that is unique to NLP.

Name Description Use Case
Iterator Iterates over the data in the order of the dataset. Test data, or any other data where the order is important.
BucketIterator Buckets sequences of similar lengths together. Text classification, sequence tagging, etc. (use cases where the input is of variable length)
BPTTIterator An iterator built especially for language modeling that also generates the input sequence delayed by one timestep. It also varies the BPTT (backpropagation through time) length. This iterator deserves its own post, so I’ll omit the details here. Language modeling

