Over time, I have developed a few software packages. These were scattered over the place and I am currently in the process of moving them to GitHub. I will add more packages once I have moved them there.

N|uu conversion

For the development of the dictionary for N|uu (see this presentation), which can be found on SADiLaR's dictionary portal and the Saasi Epsi app. Also a physical dictionary has been developed. This was done using a conversion script that takes data from field linguists as input. (Note that this package is specific to the handling of the data from this particular data collection.)

Sesotho Syllabifier

The Sesotho syllabifier takes Sesotho words as input and identifies syllabification boundaries in these words. Two systems have been implemented, a rule-based system and a simply machine learning system. The system can be trained and tested on a Sesotho syllable wordlist. This is work together with Johannes Sibeko.


DEMOCRAT is a consensus machine translation (MT) system. It stands for Deciding between Multiple Outputs Created by Automatic Translation. It takes the output of several MT systems and combines the translations in one, hopefully, better translation. How the system works internally and some results of the system have been published in "DEMOCRAT: Deciding between Multiple Outputs Created by Automatic Translation", Menno van Zaanen and Harold Somers, Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 173-180, 2005.

Alignment-Based Learning

Alignment-Based Learning (ABL) is a symbolic grammar inference framework that has succesfully been applied for several unsupervised machine learning tasks in Natural Language Processing (NLP). Given sequences of symbols only, a system that implements ABL induces structure by aligning and comparing the input sequences. As a result, the input sequences are augmented with the induced structure.


The suffix tree package contains the implementation of a suffix tree in C++. Ukkonen's (1995) algorithm for building a suffix tree is implemented. The description of this algorithm was taken from chapter 6 in Gusfield (1997).


This package contains the implementation of a simple chart parser. There is nothing fancy here, but the package is well documented and the code is quite efficient. (No special optimization has been performed though.)