Lexical exponents predicting English Wiktionary consultations

About the project

The aim of this project is to establish what lexical factors make it more likely for dictionary users to consult specific articles in a dictionary by looking them up. We will use the English Wiktionary because it is the only general dictionary used extensively with usage data freely available. Recent findings (De Schryver, Wolfer & Lew 2019) suggest that lexical frequency is a significant factor predicting look-up behaviour, with the more frequent words being more likely to be consulted. We plan to explore at least three further lexical factors: (1) age of acquisition; (2) lexical prevalence; and (3) degree of polysemy operationalized as the number of dictionary senses. To be able to do that, we need to obtain and then combine a number of datasets. First, we plan to acquire extensive logs of the English Wiktionary online user visits, clean and process them to extract consultation frequency information for all Wiktionary headwords: this will be our explained (= outcome,  response) variable. With regard to the three potential predictors, lexical frequency will be established by building lemma frequency lists from large contemporary corpora of English (enTenTen), double-checked with existing frequency lists. Age of acquisition and lexical prevalence are both available for English words thanks to recent publications. Finally, the number of senses will be obtained by parsing and counting senses in English Wiktionary. All these data need to be linked to headword lemmas. Once this is done, the modelling will begin. Model selection will follow, using criteria that are generally accepted in the relevant literature.We plan to use the more traditional regression models as well as alternative non-regression approaches suggested in the recent literature. We believe is of theoretical interest to know what makes dictionary users look up words. This knowledge will also be practically useful to lexicographers, telling them which lexical items should be prioritized in lexicographic work.

Robert Lew (right), poster (middle) and Sascha Wolfer (left) on EURALEX 2022 in MannheimRobert Lew at eLex 2023 in Brno

Project team

Principal Investigator: Robert Lew

Co-Investigator: Sascha Wolfer

Brno city centreSascha Wolfer at eLex 2023

Presentations

Publications

Upcoming events

  • Open lecture: Sascha Wolfer, ‘What dictionary look-up statistics can tell us: Predicting the CEFR level of words via Wiktionary look-ups’, 2024-05-21, Universität Hildesheim.
  • Euralex 2024

Project formals

Project funded by the National Science Centre (NCN) under contract UMO-2020/39/B/HS2/00923.

NCN logo