学位论文详细信息
Non-native text analysis with Syntactic Diff, a general comparative text mining framework
text mining;natural language processing;comparative text mining;non-native text analysis;non-native text mining;second language education;non-native English speakers
Massung, Sean Alexander
关键词: text mining;    natural language processing;    comparative text mining;    non-native text analysis;    non-native text mining;    second language education;    non-native English speakers;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/78606/MASSUNG-THESIS-2015.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

Non-native speakers of English far outnumber native speakers; English is themain language of books, newspapers, airports, air-traffic control, internationalbusiness, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising [1]. Online education in theform of MOOCs (massive online open courses) is also primarily in English—even teaching English. This creates enormous amounts of text written by non-native speakers, which in turn generates a need for grammar correction andanalysis. Even aside from MOOCs, the number of English learners only in Asiaalone is in the tens of millions.In response to this powerful motivation, we describe SYNTACTIC DIFF, a noveledit-based method for transforming sequences of words given a reference corpus. These transformations can be used directly or can be employed as featuresto represent text data in a wide variety of text mining scenarios. As case studies, we apply SYNTACTIC DIFF to four quite different tasks in non-native textanalysis and show its benefit in each case.In the first task, we use weighted word edits with likelihood scoring forgrammatical error correction. Our method is compared against systems in agrammar correction shared task, and we find that SYNTACTIC DIFF edits performcomparably while being much more general than the other methods. The second task is native language identification: a classification problem predictingthe native language of a student writer based on English essays. We represent documents as vectors of edits, and show that a combination of unigramwords and SYNTACTIC DIFF edits outperforms each representation individually.The third task is fluency scoring, in which we see if the manually categorizedfluency levels of English students can be modeled by SYNTACTIC DIFF features.In the fourth task, we create clusters of student essays with similar errors viatopic modeling, and find that the interpretability is significantly higher than ann-gram words approach.SYNTACTIC DIFF is highly customizable and able to capture syntactic differences from a reference corpus at the sentence, document, and subcorpus levels. This enables both a rich translation method and feature representationfor many text mining tasks that deal with word usage and syntax beyond bag-of-words. In particular, this thesis focuses on non-native text analysis applications, though SYNTACTIC DIFF is not at all limited to that domain.

【 预 览 】
附件列表
Files Size Format View
Non-native text analysis with Syntactic Diff, a general comparative text mining framework 266KB PDF download
  文献评价指标  
  下载次数:10次 浏览次数:18次