For the TREC2009 Chemical IR Track, we exploredevelopment of a distributed information retrieval systembased on a dimensional data model. The indexing modelsupports named entity identification and aggregation ofterm statistics at multiple levels of patent structureincluding individual words, sentences, claims, descriptions,abstracts, and titles.The system was deployed across 15 Amazon Web Services(AWS) Elastic Cloud Compute (EC2) instances and 15Elastic Block Storage (EBS) database shards to supportefficient indexing and query processing of the relativelylarge index generated from indexing each individual word(sans stop words) in the 100G+ collection of chemicalpatent documents.The query processing algorithm for technology surveysearch and prior art search uses information extractiontechniques and locally aggregated term statistics to helpdisambiguate candidate entities and terms in context. Queryprocessing for prior art search automatically generates astructured query based on the relative distinctiveness ofindividual terms and candidate entity phrases from thequery patent's claims, abstract, and title sections. For boththe technology survey and prior art search, we evaluatedseveral probabilistic retrieval functions for integratingstatistics of retrieved named entities with term statistics atmultiple levels of document structure to identify relevant
【 预 览 】
附件列表
Files
Size
Format
View
TREC Chemical IR Track 2009: A Distributed Dimensional Indexing Model forChemical Patent Search