科技报告详细信息
Detection and Analysis of Table of Contents Based on Content Association
Lin, Xiaofan ; Xiong, Yan
HP Development Company
关键词: table of contents;    document structure analysis;    table recognition;    optical character recognition;    algorithm combination;   
RP-ID  :  HPL-2005-105
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

As a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi- page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed. Notes: To be published in the International Journal on Document Analysis and Recognition, 2005 21 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100000811LZ 558KB PDF download
  文献评价指标  
  下载次数:22次 浏览次数:42次