Lexical exponents predicting English Wiktionary consultations

🌍 wersja w języku polskim (version in Polish) — link

About the project

The aim of this project is to establish what lexical factors make it more likely for dictionary users to consult specific articles in a dictionary by looking them up. We will use the English Wiktionary because it is the only general dictionary used extensively with usage data freely available. Recent findings (De Schryver, Wolfer & Lew 2019) suggest that lexical frequency is a significant factor predicting look-up behaviour, with the more frequent words being more likely to be consulted. We plan to explore at least three further lexical factors: (1) age of acquisition; (2) lexical prevalence; and (3) degree of polysemy operationalized as the number of dictionary senses. To be able to do that, we need to obtain and then combine a number of datasets. First, we plan to acquire extensive logs of the English Wiktionary online user visits, clean and process them to extract consultation frequency information for all Wiktionary headwords: this will be our explained (= outcome, response) variable. With regard to the three potential predictors, lexical frequency will be established by building lemma frequency lists from large contemporary corpora of English (enTenTen), double-checked with existing frequency lists. Age of acquisition and lexical prevalence are both available for English words thanks to recent publications. Finally, the number of senses will be obtained by parsing and counting senses in English Wiktionary. All these data need to be linked to headword lemmas. Once this is done, the modelling will begin. Model selection will follow, using criteria that are generally accepted in the relevant literature.We plan to use the more traditional regression models as well as alternative non-regression approaches suggested in the recent literature. We believe is of theoretical interest to know what makes dictionary users look up words. This knowledge will also be practically useful to lexicographers, telling them which lexical items should be prioritized in lexicographic work.

Robert Lew (right), poster (middle) and Sascha Wolfer (left) on EURALEX 2022 in Mannheim Robert Lew at eLex 2023 in Brno

Project team

Principal Investigator: Robert Lew

Co-Investigator: Sascha Wolfer

Brno city centre Sascha Wolfer at eLex 2023

Presentations

Poster presented at the 20th Euralex International Congress: Dictionaries and Society, Mannheim, Germany, July 12-16, 2022. [book of abstracts]
Talk "The Dark Side of the Wiktionary" presented at the 8th eLex conference on lexicography in the 21st century: invisible lexicography, Brno, Czech Republic, June 27-29, 2023. [slides/book of abstracts]
Talk "CEFR Vocabulary Level as a Predictor of User Interest in English Wiktionary Entries" presented at Linking Lexicographic and Language Learning Resources (4LR), Workshop at the 4th conference on Language, Data and Knowledge (LDK), Vienna, Austria, September 13, 2023 (Integration of information on the Common European Framework of Reference for Languages (CEFR) and Wiktionary look-up data)
Talk given at the leading lexicography event Euralex XXI International Congress: Lexicography and Semantics, Cavtat, Croatia, 8-12 October 2024. Talk titled "Leveraging Dictionary Look-Up Behaviour to Supplement CEFR Vocabulary Lists" was delivered by Sascha Wolfer on October 8th. It was streamed online and a video recording synchronized with slides is publicly available at: https://videolectures.net/videos/euralex2024_cavtat_wolfer_look_up_behavior

Publications

Lew, R., & Wolfer, S. (2024). What Lexical Factors Drive Look-Ups in the English Wiktionary? SAGE Open, 14(1), 21582440231219101. https://doi.org/10.1177/21582440231219101
Lew, R., & Wolfer, S. (2024). CEFR vocabulary level as a predictor of user interest in English Wiktionary entries. Humanities and Social Sciences Communications, 11, 340. https://doi.org/10.1057/s41599-024-02838-4

Events

Open lecture: Sascha Wolfer, ‘What dictionary look-up statistics can tell us: Predicting the CEFR level of words via Wiktionary look-ups’, 2024-05-21, Universität Hildesheim.
Open lecture: Robert Lew in the "WA Lunch Talks" series, talk titled "Uncovering Patterns in Dictionary Look-Up Behaviour: Machine Learning Meets CEFR Vocabulary Levels", 2024-12-11, Aula Hrynakowskiego, AMU Faculty of English, Grunwaldzka 6, Poznań.

Project formals

Project funded by the National Science Centre (NCN) under contract UMO-2020/39/B/HS2/00923.