Lexical exponents predicting English Wiktionary consultations
🌍 wersja w języku polskim (version in Polish) — link
About the project
The aim of this project is to establish what lexical factors make it more likely for dictionary users to consult specific articles in a dictionary by looking them up. We will use the English Wiktionary because it is the only general dictionary used extensively with usage data freely available. Recent findings (De Schryver, Wolfer & Lew 2019) suggest that lexical frequency is a significant factor predicting look-up behaviour, with the more frequent words being more likely to be consulted. We plan to explore at least three further lexical factors: (1) age of acquisition; (2) lexical prevalence; and (3) degree of polysemy operationalized as the number of dictionary senses. To be able to do that, we need to obtain and then combine a number of datasets. First, we plan to acquire extensive logs of the English Wiktionary online user visits, clean and process them to extract consultation frequency information for all Wiktionary headwords: this will be our explained (= outcome, response) variable. With regard to the three potential predictors, lexical frequency will be established by building lemma frequency lists from large contemporary corpora of English (enTenTen), double-checked with existing frequency lists. Age of acquisition and lexical prevalence are both available for English words thanks to recent publications. Finally, the number of senses will be obtained by parsing and counting senses in English Wiktionary. All these data need to be linked to headword lemmas. Once this is done, the modelling will begin. Model selection will follow, using criteria that are generally accepted in the relevant literature.We plan to use the more traditional regression models as well as alternative non-regression approaches suggested in the recent literature. We believe is of theoretical interest to know what makes dictionary users look up words. This knowledge will also be practically useful to lexicographers, telling them which lexical items should be prioritized in lexicographic work.
Project team
Principal Investigator: Robert Lew
Co-Investigator: Sascha Wolfer
Presentations
- Poster presented at the 20th Euralex International Congress: Dictionaries and Society, Mannheim, Germany, July 12-16, 2022. [book of abstracts]
- Talk "The Dark Side of the Wiktionary" presented at the 8th eLex conference on lexicography in the 21st century: invisible lexicography, Brno, Czech Republic, June 27-29, 2023. [slides/book of abstracts]
- Talk "CEFR Vocabulary Level as a Predictor of User Interest in English Wiktionary Entries" presented at Linking Lexicographic and Language Learning Resources (4LR), Workshop at the 4th conference on Language, Data and Knowledge (LDK), Vienna, Austria, September 13, 2023 (Integration of information on the Common European Framework of Reference for Languages (CEFR) and Wiktionary look-up data)
Publications
- Lew, R., & Wolfer, S. (2024). What Lexical Factors Drive Look-Ups in the English Wiktionary? SAGE Open, 14(1), 21582440231219101. https://doi.org/10.1177/21582440231219101
- Lew, R., & Wolfer, S. (2024). CEFR vocabulary level as a predictor of user interest in English Wiktionary entries. Humanities and Social Sciences Communications, 11, 340. https://doi.org/10.1057/s41599-024-02838-4
Upcoming events
- Open lecture: Sascha Wolfer, ‘What dictionary look-up statistics can tell us: Predicting the CEFR level of words via Wiktionary look-ups’, 2024-05-21, Universität Hildesheim.
- Euralex 2024
Project formals
Project funded by the National Science Centre (NCN) under contract UMO-2020/39/B/HS2/00923.