期刊论文详细信息
International Arab Journal of Information Technology (IAJIT)
Experimenting N-Grams in Text Categorization
Maryam Madani1  Shadpour Mallakpour2 
[1] Department of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$Department of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. IranDepartment of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$;Department of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$Nanotechnology and Advanced Materials Institute, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$Department of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. IranDepartment of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$Nanotechnology and Advanced Materials Institute, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$Nanotechnology and Advanced Materials Institute, Isfahan University of Technology, Isfahan 84156-83111, I. R. IranDepartment of Chemistry, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$Nanotechnology and Advanced Materials Institute, Isfahan University of Technology, Isfahan 84156-83111, I. R. Iran$$
关键词: Text categorization;    n-grams;    multivariate chi-square;    cosine measure;    reuters21578;    20 news groups.;   
DOI  :  
学科分类:计算机科学(综合)
来源: Zarqa University
PDF
【 摘 要 】

This paper deals with automatic supervised classification of documents. The approach suggested is based on a vector representation of the documents centred not on the words but on the n-grams of characters for varying n. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Kullback&Liebler distances, and two benchmark corpuses the reuters-21578 newswire articles and the 20 newsgroups data for evaluation. The evaluation was done, by using the macroaveraged F1 function. The results show the effectiveness of this approach compared to the Bag-Of-Word and stem representations.Keywords: Text categorization, n-grams, multivariate chi-square, cosine measure, reuters21578, 20 news groups.Received April 5, 2006; accepted June 1, 2006Full Text

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO201912010227911ZK.pdf 406KB PDF download
  文献评价指标  
  下载次数:22次 浏览次数:9次