会议论文详细信息
The 11th International Rivers ymposium
The Class Imbalance Problem in Author Identification
生态环境科学
Efstathios Stamatatos
Others  :  http://CEUR-WS.org/Vol-276/paper1.pdf
PID  :  1102
学科分类:环境科学(综合)
来源: CEUR
PDF
【 摘 要 】

Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. Previous work [3] provided solutions to this problem for instance-based author identification approaches (i.e., each training text is considered a separate training instance). This work [4] deals with the class imbalance problem in profile-based author identification approaches (i.e., a profile is extracted from all the training texts per author). In particular, a variation of the Common N-Grams (CNG) method, a language-independent profile-based approach [2] with good results inmany author identification experiments so far [1], is presented based on new distance measures that are quite stable for large profile length values. Special emphasis is given to the degree upon which the effectiveness of the method is affected by the available training text samples per author. Experiments based on text samples on the same topic from the Reuters Corpus Volume 1 are presented using both balanced and imbalanced training corpora. The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of thecandidate authors, a realistic condition in author identification problems.

【 预 览 】
附件列表
Files Size Format View
The Class Imbalance Problem in Author Identification 13KB PDF download
  文献评价指标  
  下载次数:12次 浏览次数:25次