Integrating database (DB) and information retrieval (IR) technologies has long been an important problem. Recent growth in tagged textual data, which is a combination of structured knowledge bases and unstructured text data, has made principled index design for IR applications even more desirable. In this work, we propose a framework for designing IR applications using DB concepts. We model an IR application as a specialized DB system that is required to support only a fixed predefined query workload rather than general SQL queries. The physical design of an information retrieval system is optimized for a prespecified target query workload. We then identify IR indexes as basic building blocks of the design of an IR application. An IR index is formalized as a look-up function built over a materialized view. The overall ”index design problem” is then to find an index design with lowest cost from a space of candidate designs. Since the space of candidate designs can be doubly exponential in the size of query workload, we give polynomial time heuristic algorithm with provable guarantees for a weighted set cover based relaxation of the original problem. This is then combined with an overarching branch-and-bound based optimality preserving state-space search algorithm that efficiently prunes the state-space by using the above heuristic algorithm. Finally, we show via experiments how the proposed framework can be used to obtain index designs for different kinds IR applications in practice.
【 预 览 】
附件列表
Files
Size
Format
View
Index design for information retrieval applications using database concepts