期刊论文

【摘要】

Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language; namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w₀ (the current word), p₀ (POS of the current word), p_-3 (POS of three words before), p_-2 (POS of two words before) and p_-1 (POS of the word before).

【授权许可】

Unknown

【预览】

附件列表
Files	Size	Format	View
RO201911300783272ZK.pdf	113KB	PDF	download

Journal of Computer Science
ARABIC PART OF SPEECH TAGGING USING K-NEAREST NEIGHBOUR AND NAIVE BAYES CLASSIFIERS COMBINATION \| Science Publications

Omaia Al-Omari¹ Rund Mahafdah¹ Nazlia Omar¹
关键词: Part of Speech; Natural Language Processing; Classification;
DOI : 10.3844/jcssp.2014.1865.1873
学科分类：计算机科学（综合）
来源: Science Publications
PDF


	文献评价指标
	下载次数：1次	浏览次数：1次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】