In the past decade, the popularity of the Internet and digital cameras has led to a flourishing of images and videos. Surveillance videos are increasing explosively with the huge amounts of surveillance cameras. Compared with traditional datasets in computer vision, which host only thousands of images, these large-scale datasets in the era of the Internet have grown beyond the wildest imagination, and posed a serious challenge for visual recognition and detection. To handle the challenge of visual recognition in complicated scenarios, we believe that a single feature is not enough to distinguish web-scale visual concepts.Accordingly, this dissertation proposes to combine heterogeneous features for different visual recognition tasks. We first develop a machinery called Heterogeneous Feature Machines to effectively fuse multiple types of visual features. In addition, we realize that in specific applications such as consumer photo annotation or surveillance action detection, there are also specific cues which are helpful for visual recognition tasks. We consider three scenarios: (1) consumer photo recognition, where we explore the use of metadata such as time and GPS, (2) Web image searching and annotation, where we combine both user tags and network information for visual applications, and (3) action detection in videos, where the spatial-temporal coherence is combined with multiple visual features for detection tasks. We believe heterogeneous feature fusion is useful in a wide range of applications and merits research efforts in this promising direction.
【 预 览 】
附件列表
Files
Size
Format
View
Heterogeneous Feature Fusion for Visual Recognition