| Malaysian Journal of Computer Science | |
| Comparative Study of Feature Selection Approaches for Urdu Text Categorization | |
| Tehseen Zia1  Muhammad Pervez Akhter1  Qaiser Abbas1  | |
| 关键词: Text Categorization; Feature Selection; Urdu; Performance Evaluation; Test Collection; | |
| DOI : | |
| 学科分类:社会科学、人文和艺术(综合) | |
| 来源: University of Malaya * Faculty of Computer Science and Information Technology | |
PDF
|
|
【 摘 要 】
This paper presentsacomparative study of feature selection methods for Urdu text categorization. Fivewellknownfeatureselection methods were analyzedby means ofsixrecognized classification algorithms: supportvector machines (with linear, polynomial and radial basis kernels), naive Bayes, k-nearest neighbour (KNN),and decision tree (i.e. J48). Experimentations are performed on two test collections includinga standardEMILLE collection and a naive collection. We have found that information gain, Chi statistics, and symmetricaluncertainfeature selection methods have uniformly performed in mostly cases. We also found that no solofeature selection technique is best for every classifier.That is,naive Bayes and J48 have advantage with gainratio than other feature selection methods. Similarly, support vector machines (SVM) and KNN classifiers haveshown top performance with information gain.Generally,linear SVM with any of feature selection methods outperformedother classifiers on moderate-size naive collection.Conversely, naive Bayes with any of featureselection technique has an advantage over other classifiers for a small-size EMILLE corpus.
【 授权许可】
Unknown
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| RO201912010262693ZK.pdf | 1096KB |
PDF