Self-Organising Maps and Embodied Conversational Agents for Computer Aided Language Learning

Author: Tom Anderson

Anderson, Tom, 2019 Self-Organising Maps and Embodied Conversational Agents for Computer Aided Language Learning, Flinders University, College of Science and Engineering

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact with the details.


Learning a language with a computer can be more affordable, personalisable, and repeatable than with a human teacher, but it is a challenge to deliver maximum results on those capabilities. Towards this goal, two problems for computer-aided language learning were addressed with an emphasis on pronunciation: (1) the evaluation and visualisation of pronunciation using Self-Organising Map (SOM), and (2) the use of Embodied Conversational Agents (ECAs) for language learning.

SOMs are a type of neural network that use a competitive algorithm for classification, clustering, and visualisation known for their unsupervised learning capabilities. The Kohonen Neural Phonetic Typewriter (KNPT) is a niche application of the SOM algorithm with the phonemes of speech. A review of the KNPT literature from the 1980s to the present reveals many of the common practices and identifies gaps.

Towards a practical evaluation and visualisation of pronunciation, parameters including feature representations, normalisation and map sizes were explored via grid search and cross fold validation. The use of the KNPT for the speech of single speaker was explored. MFCCs were found to enable significantly better classification of frames than spectrograms. It was found that the KNPT benefits from maps (e.g. 50x50) that are larger than those reported in the literature (seldom exceeding 25x25).

A multi-layer architecture was implemented, and two methods of subsetting data in the first layer were compared— auxiliary maps for disambiguating easily confused phonemes were compared with maps for silence, vowels, and consonants—with the finding that the latter had better performance. The multi-layer KNPT was used to investigate the phonemes of Australian English in the speech of different demographic groups in Austalk, a large corpus. An experiment with the KNPT for voice activity detection (VAD) demonstrated that even small maps (6x6) are able to perform this task.

Language learning with a computer can be enhanced by creating personalised and grounded experiences through the use of ECAs and virtual worlds. A system was implemented that used an ECA for video dubbing language learning task, with the finding that learners expect such a system to provide an evaluation of pronunciation. A novel framework for using ECAs for storytelling was presented that enables multimodal language learning. Finally, a system based on the KNPT for pronunciation learning through video games with ECAs was introduced.

The experiments and frameworks presented in this thesis demonstrate the future of Computer Assisted Pronunciation Teaching (CAPT).

Keywords: Self-organising maps, Kohonen neural phonetic typewriter, mel-frequency cepstral coefficients, voice activity detection, embodied conversational agents, computer aided language learning, pronunciation

Subject: Computer Science thesis

Thesis type: Doctor of Philosophy
Completed: 2019
School: College of Science and Engineering
Supervisor: David Powers