学位论文详细信息
Leveraging mid-level representations for complex activity recognition
Activity recognition;Self-supervised learning;Event recognition
Ahsan, Unaiza ; Essa, Irfan Interactive Computing Hays, James De Choudhury, Munmun Kira, Zsolt Parikh, Devi Sun, Chen ; Essa, Irfan
University:Georgia Institute of Technology
Department:Interactive Computing
关键词: Activity recognition;    Self-supervised learning;    Event recognition;   
Others  :  https://smartech.gatech.edu/bitstream/1853/61199/1/AHSAN-DISSERTATION-2019.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

Dynamic scene understanding requires learning representations of the components of the scene including objects, environments, actions and events. Complex activity recognition from images and videos requires annotating large datasets with action labels which is a tedious and expensive task. Thus, there is a need to design a mid-level or intermediate feature representation which does not require millions of labels, yet is able to generalize to semantic-level recognition of activities in visual data. This thesis makes three contributions in this regard. First, we propose an event concept-based intermediate representation which learns concepts via the Web and uses thisrepresentation to identify events even with a single labeled example. To demonstrate the strength of the proposed approaches, we contribute two diverse social event datasets to the community. We then present a use case of event concepts as a mid-level representation that generalizes to sentiment recognition in diverse social event images. Second, we propose to train Generative Adversarial Networks (GANs) with video frames (which does not require labels), use the trained discriminator from GANs as an intermediate representation and finetune it on a smaller labeled video activity dataset to recognize actions in videos. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding or searching for the best video frame sampling technique. Our third contribution is a self-supervised learning approach on videos that exploits both spatial and temporal coherency to learn feature representations on video data without any supervision. We demonstrate the transfer learning capability of this model on smaller labeled datasets. We present comprehensive experimental analysis on the self-supervisedmodel to provide insights into the unsupervised pretraining paradigm and how it can help with activity recognition on target datasets which the model has never seen during training.

【 预 览 】
附件列表
Files Size Format View
Leveraging mid-level representations for complex activity recognition 11836KB PDF download
  文献评价指标  
  下载次数:9次 浏览次数:8次