With the rise of Large Language Models (LLM’s) such as Chat-GPT, Artificial Intelligence (AI) has become a frequent discussion topic and is at the center of attention in many companies. There is no way around it anymore, AI is influencing the lives of everyone worldwide and will only become more influential in the coming years. As enormous amounts of funding are put into AI research, progress is made so quickly which makes working with the state-of-the-art very exciting. Things that were not possible a couple of months ago are possible now, and things that are not possible now will be possible in a couple of months.
One of the fields that has seen increasingly much attention in the past months is the fine-tuning of LLM’s. During my three-month internship at Squadra Machine Learning Company I have been working on that topic. The aim of my internship was to fine-tune a generative model so that it will be able to extract features and the values of those features from a product description. A previous approach to addressing this problem was to extract features using Named Entity Recognition (NER), but this method has not provided sufficient results. Now that LLM’s have become incredibly good at various tasks related to Natural Language Processing (NLP), extracting features using them should be possible. In this blog post I will go through the different steps I went through to fine-tune a generative model for this task.
But what does fine-tuning a generative model do exactly? Fine-tuning a pre-trained generative model customizes the model in a certain way so that it can perform certain tasks very well. The base models are trained on a broad variety of texts and are good at answering general questions or having conversations, but extracting the features out of a product description may be a more difficult task for such a model. That is where fine-tuning can be helpful, by training a model to perform a single task very well, whilst keeping the knowledge of the original model.
Picking a model
The first step was to pick a pre-trained model that I wanted to fine-tune. There are thousands of open-source models to choose from, each with its own strengths and weaknesses. On of the aims of my internship was to do everything on a consumer (16GB) GPU. This already takes away the option of using many models because they are too large to fit on a 16GB GPU. I found that with some tricks, which I will come back to later in the post, it was possible to fine-tune models with up to 7 billion parameters. In comparison, OpenAI’s GPT-4 model has 1.76 trillion parameters. The performance of these ‘small’ LLM’s has seen incredible improvements over the past couple of months, and numerous top performing models are available open source, for anyone to use. On the huggingface website you can find many of these models. Huggingface is a company that has developed a range of useful tools that can be used to make your life easier when building these kinds of applications. During the start of my internship the two 7b models that had the best performance were Mistral Ai’s Mistral-7b and Meta’s Llama2-7b so my plan was to try to fine-tune these models.
Creating a dataset
You need to show the model examples of what it should generate, so that it can change its weights accordingly. I created a dataset which consisted of prompts in this format and split the data up in training, validation, and test data. In my case, after some experimenting, I ended up with the following prompt format:
Using hashtags in the training prompt acts as a separator, through which the model is able to learn the structure of the prompt easier, this seemed to work very well for my use case.
Fine-tuning
To fine-tune the pretrained model, it first must be loaded in. To run a 7b model in full precision, around 28 GBs of GPU RAM are needed. To avoid out of memory issues, the model can be loaded in using quantization to decrease its size. The bitsandbytes library allows you to load a model in either 8-bit or 4-bit. Using 4-bit quantization, you only need around 7 GB of RAM to load the model and the activations and attention cache, which is small enough to fit on a consumer GPU. After the model is loaded, a tokenizer must be created as well. Tokenizers can also be loaded from huggingface and are stored together with the pretrained model. Tokenizers make sure that the inputs for a model are in the correct form. The input prompts are first tokenized into sub-words, which are then converted to ids. The tokenizer also adds padding tokens which are used to make sure the inputs are the same length; this is essential for training the model with batches of multiple prompts of different lengths. After all the inputs have been tokenized, fine-tuning can start. To fine-tune all 7 billion parameters of the model requires too much computation power and time to be done on a consumer GPU. One way to decrease the number of resources needed to fine-tune an LLM is to use the Parameter Efficient Fine-Tuning (PEFT) library, and more specifically, Low-Rank Adaptation (LoRA). LoRA is a PEFT method that freezes the original weights of the network and retrains only a small percentage of the original weights. After setting up LoRA, the training data is fed to the model using the Trainer class developed by huggingface. The resulting adapter model can then be loaded on top of the original model and the resulting model gives similar performance as a fully fine-tuned model. It was difficult to do proper hyperparameter tuning because checking validation loss multiple times per combination of settings would take way too long so this had to be done experimentally by hand.
Results
I created a test set, which just as the training set, consisted for 50% of features that had values, and 50% of features which had no value. Using this method, the best performing models give an accuracy of up to 94% on an independent test set, with around 8% of the features being incorrectly classified by the model, and around 4% of the instances where the model should have given no value were given a value. These results are much better than the previous NER method, which reached around 80% accuracy. Another interesting result I found is that not a lot of training data is needed to get these results, with about 100 training examples being sufficient for fine-tuning in my case. There is only one problem, the resulting model is too slow to be used in practice. It takes about two seconds to predict one feature, so if there are 50 different features in a dataset it would take around two and a half minutes to freeze. The last few weeks of my internship I have tried increasing this speed.
Speeding up inference
The main reason why the model takes so long per feature is because it has to go through the entire prompt before it can predict the feature value, and the prompt is quite long for a single feature. To decrease the number of characters the model has to go through each time I implemented hidden state cashing during inference. This is a tool that can be used if the prompts that are used during inference contain a similar sequence at the start of the prompt. Using hidden state cashing, you can do a forward pass through the model on the green text, save the hidden states, and then for each feature the model only has to go through the red text which saves about half the reading time.
After this, I trained a model to be able to predict multiple prompts at once, this would also work especially well in combination with hidden state cashing, since only the green text would get longer while the length of the red text stays the same. This did work, but more testing needs to be done on how much the inference speed actually increases and the potential loss in accuracy that could occur.
Another way I tried to decrease inference time is to use a smaller model. A couple months into my internship, models with 3 billion parameters became increasingly popular and had shown performances equal to numerous 7b models. I experimented a bit with Microsoft’s phi-2 model and got similar results to the mistral-7b model. These 3b models already speed up inference speed by a factor of two compared to the 7b models. One final thing I wanted to try out was to use the vLLM library, which according to some sources can speed up inference by 20 times compared to the method I was using. However, the 3b model that I was using was not yet supported by vLLM at the time that my internship ended so I did not get to try this. To conclude, feature extraction using fine-tuned generative models is definitely possible, but too slow to be used in practice at the moment. My experience is that there is a lot of research on developing these models, but less on the actual implementation side.
Mik van der Drift
Intern Data Science at Squadra Machine Learning Company