学位论文详细信息
News vertical search using user-generated content
QA75 Electronic computers. Computer science
McCreadie, Richard ; Ounis, Iadh
University:University of Glasgow
Department:School of Computing Science
关键词: News Vertical Search, Real-time Search, Web Search, Social Media, User-generated content, Event Identification, Query Classification, Federated Search, Crowdsourcing;   
Others  :  http://theses.gla.ac.uk/3813/1/2012mccreadiephd.pdf
来源: University of Glasgow
PDF
【 摘 要 】
The thesis investigates how content produced by end-users on the World Wide Web — referred toas user-generated content — can enhance the news vertical aspect of a universal Web search engine,such that news-related queries can be satisfied more accurately, comprehensively and in a more timelymanner. We propose a news search framework to describe the news vertical aspect of a universal websearch engine. This framework is comprised of four components, each providing a different piece offunctionality. The Top Events Identification component identifies the most important events that arehappening at any given moment using discussion in user-generated content streams. The News QueryClassification component classifies incoming queries as news-related or not in real-time. The RankingNews-Related Content component finds and ranks relevant content for news-related user queries frommultiple streams of news and user-generated content. Finally, the News-Related Content Integrationcomponent merges the previously ranked content for the user query into theWeb search ranking. In thisthesis, we argue that user-generated content can be leveraged in one or more of these components tobetter satisfy news-related user queries. Potential enhancements include the faster identification of newsqueries relating to breaking news events, more accurate classification of news-related queries, increasedcoverage of the events searched for by the user or increased freshness in the results returned.Approaches to tackle each of the four components of the news search framework are proposed,which aim to leverage user-generated content. Together, these approaches form the news vertical componentof a universal Web search engine. Each approach proposed for a component is thoroughlyevaluated using one or more datasets developed for that component. Conclusions are derived concerningwhether the use of user-generated content enhances the component in question using an appropriatemeasure, namely: effectiveness when ranking events by their current importance/newsworthiness for theTop Events Identification component; classification accuracy over different types of query for the NewsQuery Classification component; relevance of the documents returned for the Ranking News-RelatedContent component; and end-user preference for rankings integrating user-generated content in comparisonto the unalteredWeb search ranking for the News-Related Content Integration component. Analysis of the proposed approaches themselves, the effective settings for the deployment of those approachesand insights into their behaviour are also discussed.In particular, the evaluation of the Top Events Identification component examines how effectivelyevents — represented by newswire articles — can be ranked by their importance using two differentstreams of user-generated content, namely blog posts and Twitter tweets. Evaluation of the proposedapproaches for this component indicates that blog posts are an effective source of evidence to use whenranking events and that these approaches achieve state-of-the-art effectiveness. Using the same approachesinstead driven by a stream of tweets, provide a story ranking performance that is significantlymore effective than random, but is not consistent across all of the datasets and approaches tested. Insightsare provided into the reasons for this with regard to the transient nature of discussion in Twitter.Through the evaluation of the News Query Classification component, we show that the use of timelyfeatures extracted from different news and user-generated content sources can increase the accuracyof news query classification over relying upon newswire provider streams alone. Evidence also suggeststhat the usefulness of the user-generated content sources varies as news events mature, with somesources becoming more influential over time as new content is published, leading to an upward trend inclassification accuracy.The Ranking News-Related Content component evaluation investigates how to effectively rank contentfrom the blogosphere and Twitter for news-related user queries. Of the approaches tested, we showthat learning to rank approaches using features specific to blog posts/tweets lead to state-of-the-art rankingeffectiveness under real-time constraints.Finally this thesis demonstrates that the majority of end-users prefer rankings integrated with usergeneratedcontent for news-related queries to rankings containing only Web search results or integratedwith only newswire articles. Of the user-generated content sources tested, the most popular source isshown to be Twitter, particularly for queries relating to breaking events.The central contributions of this thesis are the introduction of a news search framework, the approachesto tackle each of the four components of the framework that integrate user-generated contentand their subsequent evaluation in a simulated real-time setting. This thesis draws insights from a broadrange of experiments spanning the entire search process for news-related queries. The experiments reportedin this thesis demonstrate the potential and scope for enhancements that can be brought about bythe leverage of user-generated content for real-time news search and related applications.
【 预 览 】
附件列表
Files Size Format View
News vertical search using user-generated content 6025KB PDF download
  文献评价指标  
  下载次数:7次 浏览次数:12次