Using Recurrent Neural Networks and Learner Corpus to Detect and Correct Preposition Errors by English Learners

Author: Juan Pablo Garcia Guerrero

Garcia Guerrero, Juan Pablo, 2018 Using Recurrent Neural Networks and Learner Corpus to Detect and Correct Preposition Errors by English Learners, Flinders University, College of Science and Engineering

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.

Abstract

In a global world, English has become a lingua franca used for trading, research and general communications whereby millions of people from distinct language backgrounds are learning English as a second or foreign language. Second language acquisition is challenging, and it takes years – on average - before a person can gain English proficiency. The research about this field is plentiful but not conclusive because of the large number of factors affecting the second language acquisition process. Many tools have been developed to support this process within the field of Computer-Assisted Language Learning, but traditional proofreading techniques that are suitable to check native writings are not suitable to check learner writings, whereby recent research around the learner errors detection and correction tasks has emerged with some shared tasks such as Helping Our Own -2011 and 2012 – and CoNLL 2013 and 2014 creating some baselines on which new research can be carried out. This study undertakes the task of detecting and correcting prepositional errors made by English learners through the use of Recurrent Neural Networks and more inparticular Long Short Term Memory architecture. Among the Natural Language Processing techniques, this approach differs from previous works in which it uses non-linear classifiers rather than the traditional linear classifiers such as Naïve Bayes, Maximum Entropy, Average Perceptron, N-Grams and so on. The data used to train, validate and test the algorithm was the National University of Singapore Corpus of Learner English, which is one of the largest Learner Corpora available for research and which is fully error annotated whereby data-driving techniques can be brought about. Different elements and techniques were proposed, tested and analysed before determining their positive impact on the learner prepositional error correction task and before including them in the main algorithm. On the Recurrent Neural Network side such elements were the attention mechanism, regularization through dropout, the bidirectional model, and deep recurrent neural networks. On the features and examples side, lexical features, part of speech tags, dependency parse indexes, sequence length and number of classes (prepositions) were tested using different values. Moreover, embedding layers – following the skip-gram architecture - were created from the learner corpus to assess their influence on the task at hand by varying parameters such as the window size. The final algorithm, after applying some pre-processing and post-processing techniques, renders promising results when compared to the results of the CoNLL 2013 shared task.

Keywords: Machine Learning, Recurrent Neural Networks, Learner Corpus, Preposition, Error Detection and Correction

Subject: Computer Science thesis

Thesis type: Masters
Completed: 2018
School: College of Science and Engineering
Supervisor: Professor David Powers