学位论文详细信息
Morphological Inference from Bitext for Resource-Poor Languages.
Computational Linguistics;Morphological Inference;Resource-poor Languages;Social Sciences (General);Social Sciences;Linguistics
Szymanski, Terrence D.Keshet, Ezra Russell ;
University of Michigan
关键词: Computational Linguistics;    Morphological Inference;    Resource-poor Languages;    Social Sciences (General);    Social Sciences;    Linguistics;   
Others  :  https://deepblue.lib.umich.edu/bitstream/handle/2027.42/93843/tdszyman_1.pdf?sequence=1&isAllowed=y
瑞士|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

The development of rich, multi-lingual corpora is essential for enabling new types of large-scale inquiry into the nature of language (Abney and Bird, 2010; Lewis and Xia, 2010). However, significant digital resources currently exist for only a handful of the world;;s languages. The present dissertation addresses this issue by introducing new techniques for creating rich corpora by enriching existing resources via automated processing.As a way of leveraging existing resources, this dissertation describes an automated method for extracting bitext (text accompanied by a translation) from bilingual documents. Digitized copies of printed books are mined for foreign-language material, using statistical methods for language identification and word alignment to identify instances of English-foreign bitext. After parsing the English text and transferring this analysis via the word alignments, the foreign word tokens are tagged with English glosses and morphosyntactic features.Tagged tokens such as these constitute the input to a new algorithm, presented in this dissertation, for performing morphology induction. Drawing on previous work on unsupervised morphology induction which uses the principle of minimum description length to drive the analysis (Goldsmith, 2001), the present algorithm uses a greedy hill-climbing search to minimize the size of a paradigm-based morphological description of the language. The algorithm simultaneously segments wordforms into their component morphemes and organizes stems and axes into a paradigmatic structure. Because tagged tokens are used as input, the morphemes produced by this induction method are paired with meaningful morphosyntactic features, an improvement over algorithms for unsupervised morphology based on monolingual text, which treat morphemes purely as strings of letters. Combined, these methods for collecting and analyzing bitext data offer a pathway for the automatic creation of richly-annotated corpora for resource-poor languages, requiring minimal amounts of data and minimal manual analysis.

【 预 览 】
附件列表
Files Size Format View
Morphological Inference from Bitext for Resource-Poor Languages. 1269KB PDF download
  文献评价指标  
  下载次数:3次 浏览次数:9次