The projects I have been involved in can be roughly divided in a few groups.

Identifying patterns in sequences

I am particularly interested in developing systems that can learn patterns in sequences. These systems are mostly unsupervised (where possible).  For instance, my PhD research focused on unsupervised learning of syntax in natural language sentences. I developed Alignment-Based Learning (ABL) which is such an unsupervised, language independent system. As examples of additional work, I like to mention Herman Stehouwer's PhD research which developed language models for natural language error correction, Rianne Conijn's PhD research which focused on identifying patterns in keystroke sequences, Parisa Shayan's PhD research, which investigated patterns in human behavior in the context of educational systems. Furthermore, I am involved in research (Nuette Heyns' PhD research) that tries to automatically identify patterns in literary texts.

The output of such systems can also be applied, for example, to classify. I've done some work on automatic classification of music where relevant patterns are learned from sequences of notes or lyrics. Patterns in questions can be used to improve the classification of questions for question answering purposes. Also, systems that identify patterns can be used to classify empathy based on sequences of gestures.

Low resources languages

At the moment, large language models are very popular and show amazing behavior. Unfortunately, these (currently) require huge amounts of training data. This is simply not available for low resources languages. (Note that with low resource languages I mean languages that do not have large amounts of language data in digital form. This does not necessarily imply that these languages have a small number of speakers or that they are any less than the "large" languages.) Specific approaches will need to be developed for languages that do not have large amounts of digital data.

The PhD work of Johannes Sibeko develops readability metrics for Sesotho (one of the eleven official languages of South Africa). For this to work, additional tools need to be developed, such as a syllabification system for Sesotho.

I helped develop a dictionary (in physical form, as a website, and an app) for the N|uu language. At the time of data collection (which was done over some 20 years) only a few fluent speakers of the language were still alive.

I was also involved in the TraMOOC project, which aimed at collecting data suitable for machine translation system training for languages that did not have much data available. In particular, a crowd sourcing approach was taken to see if the quality of machine translation systems could be improved even if the (collected) training data was not necessarily of extremely high quality data.

Some additional projects in this area relate to the annotation and training of tools that can identify noun compound boundaries in Dutch and Afrikaans, the development of a part-of-speech tagger for Dutch tweets, 


There are a few other topics that I am interested in. This research is not necessarily easily categorized, so I will put this under the "Additional" header. This does not mean that I find this less interesting.

Alexandra Sierra's PhD research looked at the influence of virtual characters in the context of teaching within virtual and augmented reality.