Research
Most of my work is in the areas of language technology, cooccurrence phenomena, and stylometry/Digital Humanities. Recurring key aspects of my research are:
- Methodology and fundamentals: My methodological work includes the analysis and improvement of established methods, e.g. Delta measures for authorship attribution, as well as the development of novel methods such as the generalized cooccurrence model. I am also involved in the creation of fundamental linguistic resources such as a part-of-speech tagset for Albanian.
- Actual implementation: My work has a strong practical component and I am (co-)developer of several tools and web interfaces, some of which represent the state-of-the art in their respective fields.
- Evaluation: As part of my research, I assess the performance of methods and tools in realistic settings, both to find out which established methods and tools work best and to evaluate my own work.
Language technology
I have a strong interest in natural language processing (NLP). So far, I have been active in the following areas: tokenization, part-of-speech tagging, unsupervised dependency parsing, semantic similarity, implicit emotion recognition and sentiment analysis. I am (co-)developer of two state-of-the-art tools (tokenizer and part-of-speech tagger) for German web and social media texts.
Software
- SoMaJo: A state-of-the-art tokenizer and sentence splitter for German and English web and social media texts. https://github.com/tsproisl/SoMaJo.
- SoMeWeTa: A part-of-speech tagger with support for domain adaptation and external resources that achieves state-of-the-art results on German web and social media texts. https://github.com/tsproisl/SoMeWeTa.
- Usurper: An unsupervised dependency parser. https://pypi.python.org/pypi/Usurper.
Key publications
- Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), edited by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, 665–70. Miyazaki: European Language Resources Association. https://www.aclweb.org/anthology/L18-1106. [bib, pdf]
- Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, edited by Paul Cook, Stefan Evert, Roland Schäfer, and Egon Stemle, 57–62. Berlin: Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2607. [bib, pdf]
- Kabashi, Besim, and Thomas Proisl. 2016. “A Proposal for a Part-of-Speech Tagset for the Albanian Language.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis, 4305–10. Portorož: European Language Resources Association. https://www.aclweb.org/anthology/L16-1682. [bib, pdf]
- Proisl, Thomas, Stefan Evert, Paul Greiner, and Besim Kabashi. 2014. “SemantiKLUE: Robust Semantic Similarity at Multiple Levels Using Maximum Weight Matching.” In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), edited by Preslav Nakov and Torsten Zesch, 532–40. Dublin: Association for Computational Linguistics. https://doi.org/10.3115/v1/S14-2093. [bib, pdf]
- Proisl, Thomas, Paul Greiner, Stefan Evert, and Besim Kabashi. 2013. “KLUE: Simple and Robust Methods for Polarity Classification.” In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), edited by Mona T. Diab, Timothy Baldwin, and Marco Baroni, 395–401. Atlanta, GA: Association for Computational Linguistics. http://aclweb.org/anthology/S13-2065. [bib, pdf]
Cooccurrence phenomena
I am a member in the Interdisciplinary Centre for Research on Lexicography, Valency and Collocation and a substantial part of my research is about cooccurrence phenomena like collocations or collostructions. In my PhD thesis, I developed a formal model for analyzing cooccurrences of arbitrary linguistic structures, e.g. dependency graphs.
Software
- Pareidoscope: A collection of tools for determining the association between arbitrary linguistic structures. https://github.com/tsproisl/Pareidoscope.
- Treebank.info: A web interface that allows linguists to search for dependency structures in corpora. http://treebank.info.
- CWB-treebank: An efficient graph matching software for dependency graphs. https://github.com/tsproisl/CWB-treebank.
Key publications
- Proisl, Thomas. 2022. “Use Words, Not Constructions! A New Perspective on the Unit of Analysis in Collostructional Analysis.” International Journal of Corpus Linguistics 27 (3): 349–79. https://doi.org/10.1075/ijcl.20072.pro. [bib]
- Proisl, Thomas. 2019. The Cooccurrence of Linguistic Structures. Erlangen: FAU University Press. https://doi.org/10.25593/978-3-96147-201-7. [bib]
- Uhrig, Peter, Stefan Evert, and Thomas Proisl. 2018. “Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences Across Parsers and Dependency Annotation Schemes.” In Lexical Collocation Analysis: Advances and Applications, edited by Pascual Cantos-Gómez and Moisés Almela-Sánchez, 111–40. Cham: Springer. https://doi.org/10.1007/978-3-319-92582-0_6. [bib]
- Evert, Stefan, Peter Uhrig, Sabine Bartsch, and Thomas Proisl. 2017. “E-VIEW-alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification.” In Electronic Lexicography in the 21st Century. Proceedings of the eLex 2017 Conference, edited by Iztok Kosem, Carole Tiberius, Miloš Jakubíček, Jelena Kallas, Simon Krek, and Vít Baisa, 531–49. Leiden: Lexical Computing. https://elex.link/elex2017/wp-content/uploads/2017/09/paper32.pdf. [bib, video, E-VIEW-alation, pdf]
- Uhrig, Peter, and Thomas Proisl. 2012. “Less Hay, More Needles – Using Dependency-Annotated Corpora to Provide Lexicographers with More Accurate Lists of Collocation Candidates.” Lexicographica 28 (1): 141–80. https://doi.org/10.1515/lexi.2012-0009. [bib, pdf]
- Proisl, Thomas, and Peter Uhrig. 2012. “Efficient Dependency Graph Matching with the IMS Open Corpus Workbench.” In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), edited by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, 2750–6. Istanbul: European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2012/pdf/709_Paper.pdf. [bib, pdf]
Stylometry and Digital Humanities
I am a member in the Interdisciplinary Centre for Digital Humanities and a collaborator in the BMBF eHumanities Zentrum Kallimachos. My research areas include authorship attribution and linguistic and stylistic complexity.
Software
- Textcomplexity: A collection of measures that assess the linguistic and stylistic complexity of (literary) texts. https://github.com/tsproisl/textcomplexity
Key publications
- Proisl, Thomas, Stefan Evert, Fotis Jannidis, Christof Schöch, Leonard Konle, and Steffen Pielström. 2018. “Delta vs. N-Gram Tracing: Evaluating the Robustness of Authorship Attribution Methods.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), edited by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, 3309–14. Miyazaki: European Language Resources Association. https://www.aclweb.org/anthology/L18-1523. [bib, pdf]
- Büttner, Andreas, Friedrich Michael Dimpel, Stefan Evert, Fotis Jannidis, Steffen Pielström, Thomas Proisl, Isabella Reger, Christof Schöch, and Thorsten Vitt. 2017. “‚Delta‘ in der stilometrischen Autorschaftsattribution.” Zeitschrift für digitale Geisteswissenschaften. https://doi.org/10.17175/2017_006. [bib]
- Evert, Stefan, Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielström, Christof Schöch, and Thorsten Vitt. 2017. “Understanding and Explaining Delta Measures for Authorship Attribution.” Digital Scholarship in the Humanities 32 (suppl_2): ii4–ii16. https://doi.org/10.1093/llc/fqx023. [bib]
- Evert, Stefan, Thomas Proisl, Thorsten Vitt, Christof Schöch, Fotis Jannidis, and Steffen Pielström. 2015. “Towards a Better Understanding of Burrows’s Delta in Literary Authorship Attribution.” In Proceedings of the Fourth Workshop on Computational Linguistics for Literature (CLfL 2015), edited by Anna Feldman, Anna Kazantseva, Stan Szpakowicz, and Corina Koolen, 79–88. Denver, CO: Association for Computational Linguistics. https://doi.org/10.3115/v1/W15-0709. [bib, pdf]