学位论文

【摘要】

Searching pairs of similar data records is an operation required for many data mining techniques like clustering and collaborative filtering. With emergence of the Web, scale of the data has increased to several millions or billions of records. Business and scientific applications like search engines, digital libraries, and systems biology often deal with massive data sets in a high dimensional space.The overarching goal of this thesis is to enable fast and incremental similarity search over large high dimensional data sets throughimproved indexing, systematic heuristic optimizations, and scalable parallelization.In Task 1, we design a sequential algorithm for all pairs similarity search (APSS) that involves finding all pairs of records having similarity above a specified threshold. Our proposed fast matching technique speeds-up APSS computation by using novel tighter bounds for similarity computation and indexing data structure. It offers the fastest solution known to-date with up to 6X speed-up over the state-of-the-art existing APSS algorithm.In Task 2, we address the incremental formulation of APSS problem, where APSS is performed multiple times over a given data set while varying the similarity threshold. Our goal is to avoid redundant computations across multiple invocations of APSS by storing history of computation during each APSS. Depending on the similarity threshold variation, our proposed history binning and index splitting techniques achieve speed-ups from 2X to over 100000X over the state-of-the-art APSS algorithm. To the best of our knowledge, this is the first work that addresses this problem.In Task 3, we design scalable parallel algorithms for APSS that take advantage of modern multi-processor, multi-core architectures to further scale-up the APSS computation. Our proposed index sharing technique divides the APSS computation into independent tasks and achieves ideal strong scaling behavior on shared memory architectures.We also propose a complementary incremental index sharing technique, which provides a memory-efficient parallel APSS solution while maintaining almost linear speed-up. Performance of our parallel APSS algorithms remains consistent for datasets of various sizes. To the best of our knowledge, this is the first work that explores parallelization for APSS.We demonstrate the effectiveness of our techniques using four real-world million record data sets.

【预览】

附件列表
Files	Size	Format	View
Fast, Incremental, and Scalable All Pairs Similarity Search	2019KB	PDF	download


Fast, Incremental, and Scalable All Pairs Similarity Search
similarity search;parallel algorithms;data mining;inverted index
Awekar, Amit Chintamani ; Anatoli V. Melechko, Committee Co-Chair,Christopher G. Healey, Committee Member,Kemafor Anyanwu, Committee Member,Nagiza F. Samatova, Committee Chair,Awekar, Amit Chintamani ; Anatoli V. Melechko ; Committee Co-Chair ; Christopher G. Healey ; Committee Member ; Kemafor Anyanwu ; Committee Member ; Nagiza F. Samatova ; Committee Chair
University:North Carolina State University
关键词: similarity search; parallel algorithms; data mining; inverted index;
Others : https://repository.lib.ncsu.edu/bitstream/handle/1840.16/4205/etd.pdf?sequence=1&isAllowed=y
美国\|英语
来源: null
PDF


	文献评价指标
	下载次数：42次	浏览次数：31次

【 摘 要 】

【 预 览 】

【摘要】

【预览】