The first in a series of posts about questions I’ve gotten about GEITje.
“Why did you create a language model?” I have received this question several times in recent weeks. Usually immediately followed by a follow-up question: “Doesn’t ChatGPT already exist?” Not a strange question, actually. Here are my three main reasons:
1. Because open models are needed
ChatGPT performs great in Dutch. If you have an application where you want to try a LLM, definitely go for ChatGPT or one of the OpenAI APIs. They are good and cheap, and you can whip up a solution in no time. And if you have a challenging use case: you can now even fine-tune their models on your data. But there are also disadvantages.
Firstly, there are always cases where you do not want to or are not allowed to send your data to OpenAI. Careful handling of data is important, especially with regard to the GDPR and the upcoming AI Act. Therefore, you are quickly bound to models that you can run locally on your own infrastructure, to keep your private data private.
Secondly, OpenAI, despite the increasingly ironic company name, is not very open about their models at all. They do not tell you what sources their models are trained on, what kind of filtering they apply, or even how large their models are. The only thing you get is a black box on their servers that you can talk to, and a technical report without meaningful technical details. If you want to do research into the model, for example, to determine if there are certain biases present that are important for your application, then you’re out of luck. The only thing you can do is send text to the black box and study the text you get back, but all other more advanced options are off the table. Using an open model, you can examine it in as much detail as you want.
Finally, open models offer an opportunity to build on each other’s work, and to give something back. This way, you collectively reach a much higher level than everyone is capable of on their own, and everyone benefits from that.
Open models are therefore needed, and good news: they are booming in 2023! But Dutch was unfortunately lagging behind, as the open-source world mainly focused on English, Chinese, and programming languages. A familiar story for those who have read my earlier blog post about BLOOM, although this time there were initiatives already heading in the right direction.
2. Because we can
What is needed to create a Dutch language model? And is it possible to do as a hobby project? Well, no, not if you want to build it from the ground up. Meta used 184,320 GPU hours for training Llama 2 7B, which consumed about 74,000 kWh. Such budgets are unattainable for the GPU-poor. And, suppose, you manage to conjure up free computing capacity somewhere. Then you still need to get 2,000 billion tokens of Dutch text to match the same amount of text that Meta used for LLaMA 2. And once you have trained a foundation model, you also have to create (or have someone else create) tens of thousands of conversations to train your foundation model into a real chatbot.
But what if you make some smart and practical choices? What if you don’t start from scratch? What if you build on existing open source models, such as LLaMA 2 or Mistral? Despite it probably not being the intention, apparently enough Dutch text slipped into their training data for the models to be able to speak a decent word of Dutch. Not enough to hold a longer coherent conversation, and the knowledge about Dutch or Flemish subjects of those models is very limited, but it is certainly better than starting with nothing. The only thing you have to give up is some transparency. Unfortunately, little is known about what data these open source models are trained on.
If you start with such a pre-trained model, then you don’t need such gigantic amounts of text to train your language model. GEITje is trained on 10 billion tokens of Dutch text. Up to a few hundred billion tokens, it is relatively easy to come by material. Datasets of chat conversations to train a chatbot are barely available for Dutch, but they are abundantly available in English. By using GPT-3.5 as a translator, you can convert ten thousand conversations from English to Dutch for less than 100 euros.
Then you need to get GPUs. Yes, they are extremely expensive to buy. And they are also expensive to rent from common cloud providers like AWS or Azure. But if you go for a cloud provider that has made GPUs their specialty, then it can all be a lot cheaper. Providers like Lambda Labs and RunPod can be up to 80% cheaper. If you manage to get your hands on a GPU, that is, as they are often all occupied. More on that in a later blog post.
3. Because it is fun and educational
The third reason is perhaps the most important one. It’s just an incredibly fun and educational project!
I have been interested in language models for a long time, and that interest has only grown in the past year with the breakthrough of the LLMs. Reading about them is educational, and applying them even more so. But nothing gives you more insight into a model than having to train it yourself.
To make GEITje, I had to delve into the various foundation models and their pros and cons. I had to explore the quality of datasets, and I had to make decisions about selecting data. I had to find out what the different ways of evaluating models are, and which ones are applicable to Dutch. I had to parse datasets consisting of gigantic text files and split them into separate documents. I had to write training code, which gave me more insight into the details of 🤗 Hugging Face Transformers, accelerate and Datasets. I had to write and maintain an ever-growing README. I had to experiment with different ways of training on multiple GPUs at the same time, to come up with the most cost-efficient method. For the first time, I was able to see training graphs live in the cloud at Weights & Biases. I modified a Gradio interface to offer a live demo of GEITje chat.
And finally, I simply had to debug training code and get it working in the cloud on rented GPUs. It’s quite an experience to solve a problem while not only seeing the minutes tick away on the clock but also directly on your own credit card.
This post was translated from the original Dutch with the help of GPT-4.