科技报告详细信息
Towards Combining Web Classification and Web Information Extraction: a Case
Luo, Ping ; Lin, Fen ; Xiong, Yuhong ; Zhao, Yong ; Shi, Zhongzhi
HP Development Company
关键词: Classification;    Information extraction;    Graphical model;   
RP-ID  :  HPL-2009-86
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.

【 预 览 】
附件列表
Files Size Format View
RO201804100001338LZ 1787KB PDF download
  文献评价指标  
  下载次数:54次 浏览次数:58次