科技报告详细信息
Enriching the Twitter Stream Increasing Data Mining Yield and Quality Using Machine Learning
Albayrak, Arif ; Teng, William ; Corcoran, John ; Wang, Sky C ; Maksumov, Daniel ; Loeser, Carlee ; Pham, Long
关键词: ALGORITHMS;    CLASSIFICATIONS;    DATA FLOW ANALYSIS;    EXTRACTION;    IMAGE CLASSIFICATION;    INFORMATION SYSTEMS;    MACHINE LEARNING;    METEOROLOGICAL RADAR;    PRECIPITATION (METEOROLOGY);    BAYES THEOREM;    COST ANALYSIS;    DATA MINING;    EARTH SCIENCES;    MACHINE TRANSLATION;    MATHEMATICAL MODELS;    PRECIPITATION MEASUREMENT;    REAL TIME OPERATION;    STORMS (METEOROLOGY);   
RP-ID  :  NH43B-2988,GSFC-E-DAA-TN63898
学科分类:地球科学(综合)
美国|英语
来源: NASA Technical Reports Server
PDF
【 摘 要 】

Social media data streams are important sources of real-time and historical global information for science applications. At the NASA Goddard Earth Sciences Data and Information Services Center (GES DISC), we are exploring the Twitter data stream for its potential in augmenting the validation program of NASA Earth science missions, specifically the Global Precipitation Measurement (GPM) mission. We have implemented a tweet processing infrastructure that outputs classified precipitation tweets. Inputs are "passive" tweets, along with a smaller number of tweets from "active" participants, i.e., those knowingly contributing to our effort. The "active" tweets, presumably of higher quality, enrich the Twitter stream. "Active" sources include data scraped from other social media (e.g., public Facebook posts) and data from existing crowdsourcing programs (e.g., mPING reports). In addition, there is likely relevant precipitation information in images and documents that are the end points of links often included in tweets. Information derived from these "active" sources could then be tweeted into the Twitter stream, thus enriching its quality. The objective of our current work is to mine these tweet­ linked images and documents, using neural networks, to increase the information content and quality related to precipitation. For images, we classified them as either precipitation-related or not. For training and validation, we used images obtained via the Google custom search API. We created two models: (1) by training a simple Convolutional Neural Network and (2) by using transfer learning principles to adapt a pre-trained object recognition model. For documents, both those linked to tweets and the tweet contents, we trained Hierarchical Attention Networks to determine precipitation occurrence, type, and intensity. For training and validation, we used a keyword-filtered tweet data set labelled with ground truth data from Dark Sky (an API to retrieve weather-related labels) and the National Severe Storms Laboratory's Multi­ Radar/Multi-Sensor (MRMS) system. Our results demonstrated the efficacy of our machine learning approaches for enriching the Twitter stream, to derive information potentially useful for validation of earth science satellite data.

【 预 览 】
附件列表
Files Size Format View
20180008558.pdf 1979KB PDF download
  文献评价指标  
  下载次数:101次 浏览次数:73次