学位论文详细信息
Performance of IR Models on Duplicate Bug Report Detection: A Comparative Study
duplicate;bug;Electrical and Computer Engineering
Kaushik, Nilam
University of Waterloo
关键词: duplicate;    bug;    Electrical and Computer Engineering;   
Others  :  https://uwspace.uwaterloo.ca/bitstream/10012/6439/1/Kaushik_Nilam.pdf
瑞士|英语
来源: UWSPACE Waterloo Institutional Repository
PDF
【 摘 要 】

Open source projects incorporate bug triagers to help with the task of bug reportassignment to developers. One of the tasks of a triager is to identify whether an incomingbug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports,a triager either relies on his memory and experience or on the search capabilties of the bugrepository. Both these approaches can be time consuming for the triager and may alsolead to the misidentication of duplicates. It has also been suggested that duplicate bugreports are not necessarily harmful, instead they can complement each other to provideadditional information for developers to investigate the defect at hand. This motivates theneed for automated or semi-automated techniques for duplicate bug detection.In the literature, two main approaches have been proposed to solve this problem. The first approach is to prevent duplicate reports from reaching developers by automatically filtering them while the second approach deals with providing the triager a list of top-Nsimilar bug reports, allowing the triager to compare the incoming bug report with the onesprovided in the list. Previous works have tried to enhance the quality of the suggestedlists, but the approaches either suffered a poor Recall Rate or they incurred additionalruntime overhead, making the deployment of a retrieval system impractical. To the extentof our knowledge, there has been little work done to do an exhaustive comparison ofthe performance of different Information Retrieval Models (especially using more recenttechniques such as topic modeling) on this problem and understanding the effectiveness ofdifferent heuristics across various application domains.In this thesis, we compare the performance of word based models (derivatives of theVector Space Model) such as TF-IDF, Log-Entropy with that of topic based models such asLatent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) and Random Indexing(RI). We leverage heuristics that incorporate exception stack frames, surface features,summary and long description from the free-form text in the bug report. We performexperiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of60% and 58% respectively. We find that word based models, in particular a Log-Entropybased weighting scheme, outperform topic based ones such as LSI and LDA.Using historical bug data from Eclipse and NetBeans, we determine the optimal timeframe for a desired level of duplicate bug report coverage. We realize an Online DuplicateDetection Framework that uses a sliding window of a constant time frame as afirst steptowards simulating incoming bug reports and recommending duplicates to the end user.

【 预 览 】
附件列表
Files Size Format View
Performance of IR Models on Duplicate Bug Report Detection: A Comparative Study 5378KB PDF download
  文献评价指标  
  下载次数:16次 浏览次数:25次