Part-of-Speech Bootstrapping Using Lexically-Specific Frames

Author: Richard Eduard Leibbrandt

Leibbrandt, Richard Eduard, 2009 Part-of-Speech Bootstrapping Using Lexically-Specific Frames, Flinders University, School of Computer Science, Engineering and Mathematics

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.

Abstract

The work in this thesis presents and evaluates a number of strategies by which English-learning children might discover the major open-class parts-of-speech in English (nouns, verbs and adjectives) on the basis of purely distributional information. Previous work has shown that parts-of-speech can be readily induced from the distributional patterns in which words occur. The research reported in this thesis extends and improves on this previous work in two major ways, related to the constructional status of the utterance contexts used for distributional analysis, and to the way in which previous studies have dealt with categorial ambiguity. Previous studies that have induced parts-of-speech from word distributions have done so on the basis of fixed “windows” of words that occur before and after the word in focus. These contexts are often not constructions of the language in question, and hence have dubious status as elements of linguistic knowledge. A great deal of recent evidence (e.g. Lieven, Pine & Baldwin, 1997; Tomasello, 1992) has suggested that children’s early language may be organized around a number of lexically-specific constructional frames with slots, such as “a X”, “you X it”, “draw X on X”. The work presented here investigates the possibility that constructions such as these may be a more appropriate domain for the distributional induction of parts-of-speech. This would open up the possibility of a treatment of part-of-speech induction that is more closely integrated with the acquisition of syntax. Three strategies to discover lexically-specific frames in the speech input to children are presented. Two of these strategies are based on the interplay between more and less frequent words in English utterances: the more frequent words, which are typically function words or light verbs, are taken to provide the schematic “backbone” of an utterance. The third strategy is based around pairs of words in which the occurrence of one word is highly predictable from that of the other, but not vice versa; from these basic slot-filler relationships, larger frames are assembled. These techniques were implemented computationally and applied to a corpus of child-directed speech. Each technique yielded a large set of lexically-specific frames, many of which could plausibly be regarded as constructions. In a comparison with a manual analysis of the same corpus by Cameron-Faulkner, Lieven and Tomasello (2003), it is shown that most of the constructional frames identified in the manual analysis were also produced by the automatic techniques. After the identification of potential constructional frames, parts-of-speech were formed from the patterns of co-occurrence of words in particular constructions, by means of hierarchical clustering. The resulting clusters produced are shown to be quite similar to the major English parts-of-speech of nouns, verbs and adjectives. Each individual word token was assigned a part-of-speech on the basis of its constructional context. This categorization was evaluated empirically against the part-of-speech assigned to the word in question in the original corpus. The resulting categorization is shown to be, to a great extent, in agreement with the manual categorization. These strategies deal with the categorial ambiguity of words, by allowing the frame context to determine part-of-speech. However, many of the frames produced were themselves ambiguous cues to part-of-speech. For this reason, strategies are presented to deal with both word and context ambiguity. Three such strategies are proposed. One considers membership of a part-of-speech to be a matter of degree for both word and contextual frame. A second strategy attempts to discretely assign multiple parts-of-speech to words and constructions in a way that imposes internal consistency in the corpus. The third strategy attempts to assign only the minimally-required multiple categories to words and constructions so as to provide a parsimonious description of the data. Each of these techniques was implemented and applied to each of the three frame discovery techniques, thereby providing category information about both the frame and the word. The subsequent assignment of parts-of-speech was done by combining word and frame information, and is shown to be far more accurate than the categorization based on frames alone. This approach can be regarded as addressing certain objections against the distributional method that have been raised by Pinker (1979, 1984, 1987). Lastly, a framework for extending this research is outlined that allows semantic information to be incorporated into the process of category induction.

Keywords: part-of-speech bootstrapping,word classes,language learning,lexically-specific frames,construction grammar,usage-based linguistics,POS tagging,developmental psychology,psycholinguistics

Subject: Computer Science thesis

Thesis type: Doctor of Philosophy
Completed: 2009
School: School of Computer Science, Engineering and Mathematics
Supervisor: Prof David MW Powers