科技报告详细信息
Extremely Fast Text Feature Extraction for Classification and Indexing
Forman, George ; Kirshenbaum, Evan
HP Development Company
关键词: text mining;    text indexing;    bag-of-words;    feature engineering;    feature extraction;    document categorization;    text tokenization;   
RP-ID  :  HPL-2008-91R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】
Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.
【 预 览 】
附件列表
Files Size Format View
RO201804100002272LZ 348KB PDF download
  文献评价指标  
  下载次数:19次 浏览次数:52次