International Conference on Computing and Applied Informatics 2016 | |
Table Extraction from Web Pages Using Conditional Random Fields to Extract Toponym Related Data | |
物理学;计算机科学 | |
Luthfi Hanifah, Hayyu'^1 ; Akbar, Saiful^1 | |
School of Electrical Engineering and Informatics, Bandung Institute of Technology, Indonesia^1 | |
关键词: Conditional random field; gazetteer; Geographic information retrievals (GIR); Information retrieval research; Rule-based approach; toponym; Web tables; | |
Others : https://iopscience.iop.org/article/10.1088/1742-6596/801/1/012064/pdf DOI : 10.1088/1742-6596/801/1/012064 |
|
学科分类:计算机科学(综合) | |
来源: IOP | |
【 摘 要 】
Table is one of the ways to visualize information on web pages. The abundant number of web pages that compose the World Wide Web has been the motivation of information extraction and information retrieval research, including the research for table extraction. Besides, there is a need for a system which is designed to specifically handle location-related information. Based on this background, this research is conducted to provide a way to extract location-related data from web tables so that it can be used in the development of Geographic Information Retrieval (GIR) system. The location-related data will be identified by the toponym (location name). In this research, a rule-based approach with gazetteer is used to recognize toponym from web table. Meanwhile, to extract data from a table, a combination of rule-based approach and statistical-based approach is used. On the statistical-based approach, Conditional Random Fields (CRF) model is used to understand the schema of the table. The result of table extraction is presented on JSON format. If a web table contains toponym, a field will be added on the JSON document to store the toponym values. This field can be used to index the table data in accordance to the toponym, which then can be used in the development of GIR system.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
Table Extraction from Web Pages Using Conditional Random Fields to Extract Toponym Related Data | 920KB | download |