Software

Thomas Proisl

SoMaJo

SoMaJo is a state-of-the-art tokenizer and sentence splitter for German web and social media texts that won the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. As such, it is particularly well-suited to perform tokenization on all kinds of written discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

Developed in collaboration with Peter Uhrig.

https://github.com/tsproisl/SoMaJo

SoMeWeTa

SoMeWeTa is a part-of-speech tagger that supports domain adaptation and that can incorporate external sources of information such as Brown clusters and lexica. It is based on the averaged structured perceptron and uses beam search and an early update strategy.

SoMeWeTa achieves state-of-the-art results on the German web and social media texts from the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media. Therefore, SoMeWeTa is particularly well-suited to tag all kinds of written German discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues.

https://github.com/tsproisl/SoMeWeTa

Pareidoscope

The Pareidoscope is a collection of tools for determining the association between arbitrary linguistic structures, e.g. between words (collocations), between words and structures (collostructions) or between larger linguistic structures such as dependency graphs.

https://github.com/tsproisl/Pareidoscope

Treebank.info

Treebank.info is an intuitive, graphical web interface that allows linguists to search for dependency structures in corpora.

Developed in collaboration with Peter Uhrig.

http://treebank.info

Linguistic and Stylistic Complexity

This project is a collection of measures that assess the linguistic and stylistic complexity of (literary) texts.

Developed in collaboration with other people working on the Kallimachos project.

https://github.com/tsproisl/Linguistic_and_Stylistic_Complexity

CWB-treebank

CWB-treebank is an efficient graph matching software for dependency graphs (stemmata). CWB-treebank is the main reason for the speed with which Treebank.info can answer linguistic queries.

Developed in collaboration with Peter Uhrig.

https://github.com/tsproisl/CWB-treebank

Usurper

Usurper is an implementation of the unsupervised dependency parser described by Søgaard (2012). The parser is language independent and does not need any training data.

https://pypi.python.org/pypi/Usurper

Erlangen Valency Patternbank

The Erlangen Valency Patternbank by Thomas Herbst and Peter Uhrig is an online inventory of valency patterns based on the Valency Dictionary of English. I was involved in the technical design and the implementation.

http://www.patternbank.uni-erlangen.de