M.A. programme “Language, Mind, Technology”: First year of study: Theme seminars — summer term 2025–2026

What is this list?

This is a list of theme seminars we intend to launch in the summer term (October–February) of the academic year 2025–2026 in our full-time M.A. programme Language, Mind, Technology. (For information on subject seminars for English Philology, follow this link .)

How to navigate the list?

The list is sorted by name of the teacher. The format of each entry is the following: the title of the course, the name of the teacher, and the description of the course.


Speech-based AI — an introduction

dr Zofia Malisz, KTH Royal Institute of Technology, Stockholm

The Syllabus

  1. Course organization
    1. Course description
    2. This course is an introduction to artificial intelligence systems and applications that use speech as input. In the course, we introduce current topics in speech analysis, conversion and recognition. We take a more in-depth look at text-to-speech (TTS) synthesis, treating it as a model of Speech AI. Additionally, we revise knowledge in acoustic phonetics, signal processing and machine learning as it is applied in the context of Speech AI.

    3. Course aims
    4. The course’s aim is to introduce students in the programme to the application, impact and technological implementation of modern Speech AI systems such as data-driven speech synthesis and analysis. The second aim is to enable students of different backgrounds, also in humanities, to be able to understand and interpret papers in Speech AI and prospectively manage basic Speech AI projects.

    5. Realization of the aims
    6. The course duration is 1 semester. We realise the aims via teacher lectures, short diagnostic quizzes (pass or fail), discussions in groups of selected problems as well as a final student presentation on a selected paper in Speech AI.

    7. Course contents
    8. The course first introduces the field of speech technology (progress and legacy systems, modern data-driven speech synthesisers, speech input-output representations, current challenges incl. controllability). It also re-introduces basic statistical concepts relevant for machine learning situating them in the context of Speech AI incl. state-of-the-art deep learning architectures (e.g. GAN). Knowledge in phonetics and digital processing relevant for understanding input and output representations in Speech AI is revised. The course treats TTS as a model example for Speech AI. It introduces the universal TTS engineering pipeline step by step: text processing, prediction engine, and waveform generation and explores it within each contemporary speech synthesis paradigm, from legacy systems, unit selection to statistical parametric (SPSS) and hybrid synthesisers to end-to-end systems.

    weektopicassignment
    1 Tour of the course and syllabus. What is speech AI? diagnostic "zero" quiz
    2 Legacy synthesis systems e.g.: OVE, KlattStat. Progress in speech
    technology e.g.: WaveNet - how did we get here?
    preparatory reading:
    Malisz et al. (2019)
    3 Unit selection, HMM-based, hybrid TTS systems and their impact preparatory reading:
    Taylor, P. (2009)
    4 Modern data-driven speech synthesisers: where do the improvements
    come from?
    preparatory reading:
    Watts et al. (2016)
    5 SPSS - a modern system where TTS is treated as a regression problem quiz, preparatory reading
    6 Gentle intro to digital signal processing quiz, preparatory reading
    7 Speech perception. How does the human ear perceive sound? Physical
    and perceptual measures of loudness and pitch. The Mel-scale
    preparatory reading
    8 Speech output representations in TTS. Speech in the time domain
    vs. time-frequency domain; types of spectrograms; generating spectrograms via SDFT
    quiz, preparatory reading
    9 Speech input representations in TTS. Text normalisation. Lexical representation. Festival and other front-end tools preparatory reading
    10 Deep neuronal networks in TTS. Relationship to regression and classification and other refreshments of the basics; WaveNet; Tacotron preparatory reading
    11 Speech analysis. Clustering and predictive models for speech analysis.
    Edyson as an example of ML assisted annotation
    quiz, preparatory reading
    12 Automatic speech recognition preparatory reading
    13 Selected reading. Preparation for presentations quiz, preparatory reading
    14 Student presentations
    15 Student presentations
  2. Requirements
    1. Attendance
    2. You are allowed to have one day of unexcused absence per semester; by excused I mean e.g. doctor’s leave. If you exceed the limit, credit may be denied.

    3. Grades
    4. Pass — you have required attendance, you are active in class (incl. taking quizzes), you have passed all tests and you have presented a student presentation on a scientific paper in speech AI chosen by you from a list provided by the teacher in a manner that shows your independent understanding of the paper

      Fail — you do not have required attendance or you are not active in class at all or you have not passed all tests or you have not presented a student presentation as described above.

  3. Materials and references
    1. Materials
    2. My syllabus and articles for discussion will be available on the Moodle platform.

    3. Course references
  • Jurafsky, D., Martin, J. H. (2009). Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Pearson Prentice Hall.
  • Taylor, P. (2009). Text-to-speech synthesis. Cambridge University Press.
  • Trouvain, J., & Möbius, B. (2020). Speech synthesis: text-to-speech conversion and artificial voices. Handbook of the Changing World Language Map, 3837-3851.
  • Johnson, K., & Johnson, K. (2004). Acoustic and auditory phonetics. Phonetica, 61(1), 56-58
  • Black, A., Taylor, P., Caley, R., & Clark, R. (1998). The Festival speech synthesis system
  • Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499
  • Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
  • King, S. (2015, August). A Reading list of recent advances in speech synthesis. In Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow, UK, Paper (No. 1043).
  • Wagner, P., Beskow, J., Betz, S., Edlund, J., Gustafson, J., Eje Henter, G., ... & Voße, J. (2019). Speech synthesis evaluation—state-of-the-art assessment and suggestion for a novel research program. In Proceedings of the 10th Speech Synthesis Workshop (SSW10).
  • Watts, O., Henter, G. E., Merritt, T., Wu, Z., & King, S. (2016, March). From HMMs to DNNs: where do the improvements come from?. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5505-5509). IEEE.
  • Malisz, Z., Henter, G. E., Valentini-Botinhao, C., Watts, O., Beskow, J., & Gustafson, J. (2019). Modern speech synthesis for phonetic sciences: A discussion and an evaluation. In International Congress of Phonetic Sciencesnces ICPhS 2019 5-9 August 2019, Melbourne, Australia Melbourne Convention and Exhibition Centre.
  • Fallgren, P., Malisz, Z., & Edlund, J. (2019). How to annotate 100 hours in 45 minutes. In Interspeech 2019 15-19 September 2019, Graz (pp. 341-345). ISCA.
  • King, S. (2003). Dependence and independence in automatic speech recognition and synthesis. Journal of Phonetics, 31(3-4), 407-411.
  • Office hours
  • All consultations regarding the subject and student presentations take place during the seminar. I might be available after each seminar for a few minutes for questions, otherwise by individual appointment

  • Contact
  • With any course-connected enquiries, feel free to email me at: malisz at kth dot se


    Speech Representation and Modeling

    dr Zofia Malisz, KTH Royal Institute of Technology, Stockholm

    Course Description

    This course introduces the study of speech from the perspective of phonetics, phonological representation, and formal modeling. It focuses on how speech is structured, represented, and analyzed, with particular attention to acoustic-phonetic detail and linguistic interpretation.

    Speech synthesis is treated as an analytical case study for examining how linguistic structure is formalized and mapped onto acoustic realization. Rather than emphasizing implementation, the course concentrates on representational assumptions, modeling decisions, and the relationship between linguistic categories and phonetic detail.

    Students revisit foundational concepts in acoustic phonetics, speech segmentation, prosody, and variability in spoken language. The course situates speech processing within broader questions about how linguistic knowledge can be formally described and operationalized.

    Course Aims

    The aim of the course is to develop students’ understanding of how speech is represented at different levels of linguistic structure (phonetic, phonological, prosodic) and how these representations can be modeled in formal systems.

    A further aim is to strengthen students’ ability to read and critically interpret research literature dealing with speech representation and modeling, particularly where linguistic and statistical perspectives intersect.

    Course Realization

    The course runs for one semester and consists of lectures, short diagnostic quizzes (pass/fail), and guided discussions of selected research papers. Students prepare a final presentation analyzing a chosen publication related to speech representation or modeling.

    Course Contents

    The course begins with an overview of the linguistic structure of speech, including segmental and suprasegmental organization, acoustic correlates of phonological categories, and variability in spoken language.

    Core topics include:

    • Acoustic phonetics and speech signal properties
    • Speech segmentation and representation
    • Prosody and intonation
    • Statistical modeling as applied to speech data
    • Historical and contemporary approaches to speech synthesis viewed as formal models of linguistic knowledge

    Text-to-speech systems are examined as structured pipelines that require explicit representation of linguistic information, including text normalization, phonological encoding, and mapping to acoustic output. Different paradigms are discussed primarily in terms of how they formalize linguistic structure and manage variability.