Language Model Fine-Tuning For Pre-Trained Transformers

Supercharging pre-trained Transformer models to deal with datasets containing domain-specific language.

Transfer Learning

Broadly speaking, Transfer Learning is the idea of taking the knowledge gained from performing some task and applying it towards performing a different (but related) task. Transformer models, currently the undisputed masters of Natural Language Processing (NLP), rely on this technique to achieve their lofty state-of-the-art benchmarks.

Transformer models are first trained on huge (and I mean huge) amounts of text in a step called “pre-training”. During this step, the models are expected to learn the words, grammar, structure, and other linguistic features of the language. The text is represented by tokens each of which has its own unique id. The collection of all such tokens is referred to as the vocabulary of the model. All the actual words in the text are iteratively split into pieces until the entire text consists only of tokens which are present in the vocabulary. This is known as tokenization. The idea behind tokenization is to convert the actual text into a numerical representation which can be used with neural network (NN) models.

Once the text is converted into an NN-friendly format, we can go about training the model to understand language. One technique used for this is the masked language modelling approach where a certain percentage of the tokens in a sequence is replaced with a mask token and the model is asked to predict the token that was there. Essentially, this functions as a fill in the blanks task for the model. When trained on such a task, a model can learn good representations for the various tokens in the vocabulary. Thanks to the attention mechanisms used in the Transformer architecture, these learned representations are also context-dependent. The outcome of this pre-training is a model that is capable of accurately modelling a language. Or in other words, a model that has an “understanding” of the different linguistic features and rules of the language.

This pre-trained model can then be fine-tuned on a variety of NLP tasks such as classification, named entity recognition, and question answering. The knowledge gained in the pre-training stage can be applied successfully to the next task as much of the language features remain constant between different tasks. As added advantages of this approach, the massive amounts of computation used in the pre-training stage is reused for all subsequent tasks as pre-training only needs to be done once. Also, the data available for the pre-training is far more abundant than what may be available for a given task as any text in the proper language can be utilized for the pre-training stage without the need for labelling.

When Transfer Learning Needs a Boost

At Skil.AI, we have used pre-trained language models with great success in a variety of different tasks and use cases. However, pre-trained language models can be comparatively less effective when used in situations where the text is highly specialized and/or technical such as with health care data. In such cases, it can be helpful to further train the language model using the same algorithm used in the pre-training stage (e.g. masked language modelling) on the text from the actual dataset that you are interested in. This is typically known as language model fine-tuning. Here, the idea is to have the model learn how to better represent the specialized language features of the dataset, including things like technical jargon, with which the pre-trained model might not be familiar. The SciBert paper successfully demonstrates the effectiveness of this approach when used for tasks involving scientific terms and language.

If all of that sounded a little too jargon-heavy for you (let alone BERT), I promise that using this technique really isn’t that hard! At Skil.AI, we use the Simple Transformers library and it’s built-in support for language model fine-tuning whenever we want to teach a Transformer model a little bit of nerd-speak (a.k.a. specialized language feature stuff). We’ve shared a quick script below that should be enough to get you started! (Please refer to the library for installation instructions and additional features).

We assume that you have combined all the text in your dataset into two text files train.txt and test.txt which can be found in the data/ directory. We’d also recommend combining your train data for the train.txt file and your test data for the test.txt file so that your model isn’t cheating by peeking into the test data.


That is pretty much all you need to get started with fine-tuning your own language models! We hope that you will find this technique as useful in your work as we have in ours!

I am a consultant in Deep Learning and AI-related technology for As part of the Deep Learning Research team at, we work towards making AI accessible to small businesses and big tech alike. This article is aimed towards sharing our knowledge.

Related Posts