At the pressing request of Stichting BREIN, GEITje is no longer available as of today. All model files have been removed from my HuggingFace repositories1.
GEITje was a Dutch-language large open language model with 7 billion parameters, based on Mistral 7B. It was (further) trained on 10 billion tokens of Dutch text, improving its proficiency in Dutch and its knowledge of Dutch-specific topics.
As stated in the README, GEITje was partially trained in late 2023 using portions of the Dutch Gigacorpus. Stichting BREIN claims that some subsets of the Gigacorpus contain copyrighted material sourced from illegal sources. For this reason, they had the entire Gigacorpus taken offline in August 2024.
BREIN has informed me that, in their view, current laws and regulations dictate that the model GEITje must also be taken offline. Copyright experts have assured me that this issue is not as black and white as claimed. However, they also acknowledge that many legal questions regarding this matter remain unanswered in Europe. I cannot afford to engage in a lengthy and costly legal battle to resolve these issues. After all, GEITje was a non-commercial, scientific hobby project. For this reason, I am complying with BREIN’s request.
Since GEITje’s release, scientific papers have been published using GEITje to study large language models in Dutch. I had hoped GEITje would remain available to researchers to ensure the scientific reproducibility of their studies. Unfortunately, discussions with BREIN on this matter have led nowhere.
I am grateful for the many positive responses I have received over the past year. It has also been wonderful to see how GEITje has inspired so many people. GEITje has demonstrated that a viable alternative, originating from Dutch and Flemish efforts, can exist alongside the closed language models of foreign tech giants. GEITje is no longer alone: open Dutch-language LLMs now exist in many forms and flavors, trained on a variety of different sources.
In my view, the future of European AI still lies in open-source AI. Only when AI is free to use, can be studied by everyone, and is freely available to modify and share for any purpose can we truly speak of sovereign AI. The French and Spanish governments have already paved the way, training fully open-source models with public funding. The path to a truly open-source Dutch-language AI landscape is still open for us to take.
Addendum 27 January 2025: Brein has now published a press release.
This post was translated from the original Dutch with the help of GPT-4o.
-
Deleted are all
.safetensors
files (the weights of the model) of GEITje-7B and all derived chat models trained by me, including those of all intermediate training checkpoints. Theoptimizer.pt
files of the checkpoints have also been deleted. Additionally, all conversions of the models (such as.gguf
files) that were made by me are also deleted. ↩︎