Posts

GPT-NL as a Dutch Sovereign Alternative? The Preliminary Benchmarks Don't Look Good

This blog post was automatically translated from the original Dutch using Mistral Vibe. Sovereign AI has never been more relevant. The US government has subjected Anthropic’s latest LLMs to export restrictions, even after the sharp edges of Mythos had already been smoothed out in the form of Fable. In addition to sparking a wave of entertaining Le Chaton Fat memes in Europe, this has reignited a serious discussion: how can we leverage the positive aspects of AI technology without once again becoming fully dependent on foreign parties?...

Take a Page from the French: The Netherlands Needs Data More Than GPUs

“Data!” I always give the same answer when I’m asked what kind of investments are needed to further support Dutch-language AI. Data. GPUs are nice, but for that I don’t need to be in Groningen. I can already make use of existing European “AI factories” such as the MareNostrum5 supercomputer in Barcelona. What we really need is high-quality, large-scale Dutch-language data. And that’s exactly what they don’t have in Barcelona....

The end of GEITje 1

At the pressing request of Stichting BREIN, GEITje is no longer available as of today. All model files have been removed from my HuggingFace repositories1. GEITje was a Dutch-language large open language model with 7 billion parameters, based on Mistral 7B. It was (further) trained on 10 billion tokens of Dutch text, improving its proficiency in Dutch and its knowledge of Dutch-specific topics. As stated in the README, GEITje was partially trained in late 2023 using portions of the Dutch Gigacorpus....

Interview in the Poki-podcast: "The Dutch Language Model: GEITje ft. Edwin Rijgersberg"

This week I was honored to star as a guest in Alexander Klöpping’s en Wietse Hage’s podcast: Poki – de Podcast over Kunstmatige Intelligentie. We had a good converstation about GEITje, about finetuning Large Language models in general and finetuning for Dutch in particular. We spoke for about half an hour, and the conversation ended practically without edits in the podcast. Including what will have become a classic now: the Bassietest....

GEITje FAQs: Why the name "GEITje"?

The second in a series of posts about questions I get about GEITje. “Why the name GEITje?” Muppets, Cows, and Seals The name “GEITje” had actually been in the back of my head for a long time as the name for a Dutch large language model. Naming in the world of language models is subject to interesting trends. In 2017, the Muppet generation of language models started with Allen Institute for AI’s ELMo, followed by Google’s breakthrough BERT....

GEITje FAQs: Why I trained GEITje

The first in a series of posts about questions I’ve gotten about GEITje. “Why did you create a language model?” I have received this question several times in recent weeks. Usually immediately followed by a follow-up question: “Doesn’t ChatGPT already exist?” Not a strange question, actually. Here are my three main reasons: 1. Because open models are needed ChatGPT performs great in Dutch. If you have an application where you want to try a LLM, definitely go for ChatGPT or one of the OpenAI APIs....

GEITje 7B: A Large Open Dutch Language Model

It has been more than two weeks since I open-sourced GEITje 7B. It was an exciting moment, especially since this is my first major open source contribution. But I am very pleased to see how enthusiastic all the reactions have been! GEITje is a large open Dutch language model with 7 billion parameters, based on Mistral 7B. It has been further trained on 10 billion tokens of Dutch text. This has improved its Dutch language skills and increased its knowledge of Dutch topics....

Left behind: why the Dutch language is absent from Europe's foremost open language model

Three volunteers. A couple of weeks of work. That’s what it took to add a language to BigScience BLOOM, the open multilingual language model with no fewer than 176 billion parameters that was released mid-2022. It aimed to become an open and multilingual alternative to GPT-3. In the end, 46 languages from all over the world made it into the dataset BLOOM was trained on. Even relatively small languages like Basque and Catalan managed to be included....

My talk at EuroPython 2023: "Threat to Life — Preventing Planned Murders with Python"

I can’t often publicly share details about the kind of projects we undertake at the Netherlands Forensic Institute with the help of AI, but at the recent EuroPython 2023 in Prague, I was able to discuss a case that unfolded a few years ago and on which the NFI had previously issued a press release: the Threat-to-Life project. Police could read along with criminals In 2020, the police managed to read live messages from a provider of so-called cryptophones: modified phones that — for a substantial payment — were used for encrypted communication in the criminal circuit....