Transformer-based models have rapidly become the state-of-the-art for many Natural Language Processing (NLP) tasks, the most infamous probably being OpenAI’s GPT-3. Such models can help people with different writing tasks and take away some labor-intensive and manual burdens. Imagine you’re a retailer that sells several different models of a product type. On your website you would need a product description to attract customers, and for SEO purposes it’s best not to use the same description for every similar product. Now, using your creativity to write a catchy product description is fun. Rewriting this description into 10 slightly different versions is not. In this blog we’ll explain how you can train a model that can automatically generate paraphrases from an input sentence.

If you’re reading this article out of interest in NLP, you probably have heard about the transformer by now. If you haven’t, we’ll give just a quick update. The transformer is a type of neural architecture that is particularly useful for sequence modelling [1]. Transformers have an encoder-decoder structure, without any of the commonly used recurrences and convolutions, but solely relying on self-attention mechanisms. This self-attention allows the model to look at the complete input sequence at once and gives it a long-term memory, much larger than any other neural architecture.

Figure 1: The original transformer architecture

With this transformer architecture, many multi-purpose language models have been created that are the state-of-the-art for tackling a wide variety of natural language processing and generation tasks. Let’s think of the original transformer architecture as pasta dough and all the following models are the different pasta shapes that can be formed and have their own purpose. The shape of our pasta influences our choice of the sauce. Not every sauce is suited for every shape, just like different transformer-based models support different NLP tasks. Google’s BERT model only consists of encoder blocks from the original transformer and is top of the class for tasks like question-answering. On the other hand, the models in the GPT series are built from only the decoder part of the transformer and they stand out in natural language generation.

To choose our pasta shape, we have to look at our paraphrasing sauce. A set of paraphrases is two sentences that have the same content but use different words. For our task this means that we’re transforming an input sentence into a different sentence while maintaining the meaning of the input. This is a different type of task than question-answering or natural language generation and thus requires a different pasta shape. Since we’re converting one text into another, a model with an encoder-decoder structure works best. Therefore, we will use T5 to create our paraphrasing tool.


Google’s Text-to-Text Transfer Transformer or T5 has an encoder-decoder structure that is very close to that of the original transformer and was pre-trained on 750 GB of diverse texts [2]. The model was released in several sizes and in this article, we’re fine-tuning the smallest pre-trained model of 60 million parameters with the Huggingface Transformers[1] library.

To generalize one model to multiple goals, it treats every downstream NLP task as a text-to-text issue. For training purposes, T5 expects each example to have an input and a target. During pre-training, inputs are corrupted by masking sequences of words, which are the target for the model prediction. During fine-tuning the input remains untouched and the target is the desired output.


Figure 2: Pre-training and fine-tuning of T5

In order to use the model for different goals, a prefix is added to the input in the fine-tuning stage that indicates what task the model is training for with that specific input. In our case, input and output will be pairs of paraphrases so we add the prefix “paraphrase:” to each input. When the model is fine-tuned, it will recognize these prefixes and the corresponding tasks, so that it will return a paraphrase if you start your input with “paraphrase” and not, for example, a translation.

Figure 3: Fine-tuning T5 for different tasks


Let’s get all the ingredients for our sauce together. To make sure our model has enough data to learn from, we’re combining three publicly available datasets of English paraphrase pairs:

The data in these sets were originally labelled as being a paraphrase or a non-paraphrase, so for our purpose we can filter out the non-paraphrases. This leaves us with a total of 146,663 paraphrase pairs. However, the Quora Question Pairs dataset is 4 times as large as the PAWS-Wiki and MSRP datasets combined. Training the models with such a large number of questions has the risk of generating only questions, because the input mostly consist of questions. For this reason, we create a balanced final dataset with an equal number of questions and non-questions. This final dataset then has 65,608 sets of paraphrases, which are split into train, test and validation sets with a respective distribution of 0.8, 0.1 and 0.1.

Output filtering

As mentioned before, paraphrasing is converting one sentence into another while maintaining the meaning. This puts two constraints on the results. First of all, the input and output should semantically be as close as possible. Secondly, there should be enough difference in the use of words. Two identical sentences are obviously semantically close, but they don’t paraphrase each other. At the same time, it’s difficult to maintain the exact same meaning when changing every word in a sentence, since a sentence can take on a different nuance. These two constraints are our salt and pepper. A good balance between the two will produce a good result.

To measure the semantic distance between the input and the results, we use Google’s Universal Sentence Encoder (USE) to create an embedding of each sentence. These embeddings are 512-dimensional vectors that are produced in such a way that sentences that are related will be closer to each other in the vector space than unrelated sentences. This way we can calculate the cosine similarity between the vector of the input and the vector of an output to express their semantic equality in a value ranging from 0 to 1. The higher this value, the more related the two sentences are.

To quantify the differences in surface form, we apply two measures. ROUGE-L measures the longest common subsequence of words between the input and output. BLEU is a representation of n-gram precision on word level. With both measure, identical sentences will be scored as 1 and sentences with not a single word in common will get a value of 0. To make sure the outputs are not too similar to the input, we determine a maximum value these measures can take on. Sentences where either the ROUGE-L score or BLEU score is higher will be discarded. We set the cut-off values to be 0.7, but you can adjust to taste if necessary.

After the results are filtered, the remaining outputs are ranked according to their USE score and the output with the highest value is selected as the best paraphrase.


Now that we’ve finished cooking, let’s taste! For each input, we let the model generate 10 possible paraphrases. Next, we filter and rank those based on the USE, ROUGE-L and BLEU scores. The sentence that is ranked on top is our result paraphrase. Here are some examples of paraphrases generated with our model.

1Having won the 2001 NRL Premiership, the Knights traveled to England to play the 2002 World Club Challenge against Super League champions, the Bradford Bulls.The Knights came to England for the World Club Challenge 2002 against the Bradford Bulls, the Super League champions after winning the 2001 NRL Premiership.0.96110.46510.2411
2Is tap water in Italy good for drinking?Is tap water good for drinking in Italy?0.96410.62500.3083
3Make sure you have the right gear when you explore the nature.When exploring nature, make sure you have the right gear.0.83500.59990.3887

These results look pretty good. Numbers 2 and 3 are paraphrased by adapting the word order of the input sentence, which is also represented in the relatively higher ROUGE-L and BLEU scores. Example number 1 shows a great balance between salt and pepper. The sentence structure is completely different and the semantic similarity is extremely high. Our paraphrasing model is a recipe for success!


[1]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser en I. Polosukhin, „Attention Is All You Need,” Advances in neural information processing systems, pp. 5998-6008, 2017.
[2]C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Metana, Y. Zhou, W. Li en P. Liu, „Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, nr. 140, pp. 1-67, 2020.