I am a postdoc at
Uppsala University Computational Linguistics, working with
Joakim Nivre and funded by the
Walter Benjamin Fellowship of the
German Research Foundation (DFG).
Before that, I was a postdoc at
UT Austin Linguistics, working with
Kyle Mahowald, funded by the
German Research Foundation (DFG) and the
German Academic Exchange Service (DAAD).
I completed my PhD in 2024 at the
Center for Information and Language Processing at
LMU Munich where my thesis was about
Computational Approaches to Construction Grammar and Morphology. My supervisor was
Hinrich Schütze.
Previously, I completed my B.Sc. and M.Sc. degrees in Computational Linguistics and Computer Science at LMU, with scholarships from the
German Academic Scholarship Foundation and the
Max Weber Program.
My M.Sc. thesis, supervised by Hinrich Schütze, was on the application of Complementary Learning Systems Theory to NLP.
I spent the final year of my bachelor's degree as a visiting student at
Homerton College,
University of Cambridge, where I wrote my B.Sc. thesis on Character-Level RNNs under the supervision of
Anna Korhonen.
Current Research Interests
- Construction Grammar and NLP
- Emergent structure in Language
- Interactions between Cognitive Linguistics and NLP
- Computational Typology and Morphosyntax
- Evaluation and Interpretability for Low-Resource Languages
Selected Publications
For a full list of publications, please see my Google Scholar profile.
Jaap Jumelet,
Leonie Weissweiler, Joakim Nivre, Arianna Bisazza (2025).
MultiBLiMP 1.0: A massively multilingual benchmark of linguistic minimal pairs. Transactions of the Association for Computational Linguistics (TACL),
to appear. (
TACL)
Abstract
PDF
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
Shijia Zhou,
Leonie Weissweiler, Taiqi He, Hinrich Schütze, David Mortensen, Lori Levin (2024).
Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. (
LREC-COLING)
Abstract
PDF
In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM's understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don't adequately represent their meaning or capture the lexical properties of phrasal heads.
Leonie Weissweiler, Abdullatif Köksal, Hinrich Schütze (2024).
Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena. arXiv preprint. (
arXiv)
Abstract
PDF
Argument Structure Constructions (ASCs) are one of the most well-studied construction groups, providing a unique opportunity to demonstrate the usefulness of Construction Grammar (CxG). For example, the caused-motion construction (CMC, ``She sneezed the foam off her cappuccino'') demonstrates that constructions must carry meaning, otherwise the fact that ``sneeze'' in this context causes movement cannot be explained. We form the hypothesis that this remains challenging even for state-of-the-art Large Language Models (LLMs), for which we devise a test based on substituting the verb with a prototypical motion verb. To be able to perform this test at statistically significant scale, in the absence of adequate CxG corpora, we develop a novel pipeline of NLP-assisted collection of linguistically annotated text. We show how dependency parsing and GPT-3.5 can be used to significantly reduce annotation cost and thus enable the annotation of rare phenomena at scale. We then evaluate GPT, Gemini, Llama2 and Mistral models for their understanding of the CMC using the newly collected corpus. We find that all models struggle with understanding the motion component that the CMC adds to a sentence.
Leonie Weissweiler*, Valentin Hofmann*, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schuetze, Kemal Oflazer, David Mortensen (2023).
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. (
EMNLP)
Abstract
PDF
Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.
Leonie Weissweiler, Taiqi He, Naoki Otani, David R. Mortensen, Lori Levin, Hinrich Schütze (2023).
Construction Grammar Provides Unique Insight into Neural Language Models. Proceedings of the First International Workshop on Construction Grammars and NLP (
CxGs+NLP, GURT/SyntaxFest 2023), pages 85–95, Washington, D.C.. Association for Computational Linguistics.
Abstract
PDF
Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mind, as well as probing methodology that was designed for specific constructions. We analyse selected previous work in detail, and provide our view of the most important challenges and research questions that this promising new field faces.
Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, Hinrich Schütze (2022).
The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative . Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10859–10882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. (
EMNLP)
Abstract
PDF
Source code on Github
Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models' behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.