Interview in the Poki-podcast: "The Dutch Language Model: GEITje ft. Edwin Rijgersberg"

This week I was honored to star as a guest in Alexander Klöpping’s en Wietse Hage’s podcast: Poki – de Podcast over Kunstmatige Intelligentie. We had a good converstation about GEITje, about finetuning Large Language models in general and finetuning for Dutch in particular. We spoke for about half an hour, and the conversation ended practically without edits in the podcast. Including what will have become a classic now: the Bassietest....

17 January 2024 · 1 min · Edwin Rijgersberg

GEITje FAQs: Why the name "GEITje"?

The second in a series of posts about questions I get about GEITje. “Why the name GEITje?” Muppets, Cows, and Seals The name “GEITje” had actually been in the back of my head for a long time as the name for a Dutch large language model. Naming in the world of language models is subject to interesting trends. In 2017, the Muppet generation of language models started with Allen Institute for AI’s ELMo, followed by Google’s breakthrough BERT....

3 January 2024 · 3 min · Edwin Rijgersberg

GEITje FAQs: Why I trained GEITje

The first in a series of posts about questions I’ve gotten about GEITje. “Why did you create a language model?” I have received this question several times in recent weeks. Usually immediately followed by a follow-up question: “Doesn’t ChatGPT already exist?” Not a strange question, actually. Here are my three main reasons: 1. Because open models are needed ChatGPT performs great in Dutch. If you have an application where you want to try a LLM, definitely go for ChatGPT or one of the OpenAI APIs....

2 January 2024 · 6 min · Edwin Rijgersberg

GEITje 7B: A Large Open Dutch Language Model

It has been more than two weeks since I open-sourced GEITje 7B. It was an exciting moment, especially since this is my first major open source contribution. But I am very pleased to see how enthusiastic all the reactions have been! GEITje is a large open Dutch language model with 7 billion parameters, based on Mistral 7B. It has been further trained on 10 billion tokens of Dutch text. This has improved its Dutch language skills and increased its knowledge of Dutch topics....

2 January 2024 · 2 min · Edwin Rijgersberg
BigScience Bloom

Left behind: why the Dutch language is absent from Europe's foremost open language model

Three volunteers. A couple of weeks of work. That’s what it took to add a language to BigScience BLOOM, the open multilingual language model with no fewer than 176 billion parameters that was released mid-2022. It aimed to become an open and multilingual alternative to GPT-3. In the end, 46 languages from all over the world made it into the dataset BLOOM was trained on. Even relatively small languages like Basque and Catalan managed to be included....

18 September 2023 · 10 min · Edwin Rijgersberg
Screenshot of talk at EuroPython 2023

My talk at EuroPython 2023: "Threat to Life — Preventing Planned Murders with Python"

I can’t often publicly share details about the kind of projects we undertake at the Netherlands Forensic Institute with the help of AI, but at the recent EuroPython 2023 in Prague, I was able to discuss a case that unfolded a few years ago and on which the NFI had previously issued a press release: the Threat-to-Life project. Police could read along with criminals In 2020, the police managed to read live messages from a provider of so-called cryptophones: modified phones that — for a substantial payment — were used for encrypted communication in the criminal circuit....

11 September 2023 · 4 min · Edwin Rijgersberg