{"title":"Multi-scale exposure fusion via gradient domain guided image filtering","authors":"F. Kou, Zhengguo Li, C. Wen, Weihai Chen","doi":"10.1109/ICME.2017.8019529","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019529","url":null,"abstract":"Multi-scale exposure fusion is an efficient way to fuse differently exposed low dynamic range (LDR) images of a high dynamic range (HDR) scene into a high quality LDR image directly. It can produce images with higher quality than single-scale exposure fusion, but has a risk of producing halo artifacts and cannot preserve details in brightest or darkest regions well in the fused image. In this paper, an edge-preserving smoothing pyramid is introduced for the multi-scale exposure fusion. Benefiting from the edge-preserving property of the filter used in the algorithm, the details in the brightest/darkest regions are preserved well and no halo artifacts are produced in the fused image. The experimental results prove that the proposed algorithm produces better fused images than the state-of-the-art algorithms both qualitatively and quantitatively.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134539607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Eggert, Stephan Brehm, Anton Winschel, D. Zecha, R. Lienhart
{"title":"A closer look: Small object detection in faster R-CNN","authors":"C. Eggert, Stephan Brehm, Anton Winschel, D. Zecha, R. Lienhart","doi":"10.1109/ICME.2017.8019550","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019550","url":null,"abstract":"Faster R-CNN is a well-known approach for object detection which combines the generation of region proposals and their classification into a single pipeline. In this paper we apply Faster R-CNN to the task of company logo detection. Motivated by the weak performance of Faster R-CNN on small object instances, we perform a detailed examination of both the proposal and the classification stage, examining their behavior for a wide range of object sizes. Additionally, we look at the influence of feature map resolution on the performance of those stages. We introduce an improved scheme for generating anchor proposals and propose a modification to Faster R-CNN which leverages higher-resolution feature maps for small objects. We evaluate our approach on the Flicker data set improving the detection performance on small object instances.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133509759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visual relationship detection with object spatial distribution","authors":"Yaohui Zhu, Shuqiang Jiang, Xiangyang Li","doi":"10.1109/ICME.2017.8019448","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019448","url":null,"abstract":"Recently, object recognition techniques have been rapidly developed. Most of existing object recognition focused on recognizing several independent concepts. The relationship of objects is also an important problem, which shows in-depth semantic information of images. In this work, toward general visual relationship detection, we propose a method to integrate spatial distribution of object to facilitate visual relation detection. Spatial distribution can not only reflect positional relation of object but also describe structural information between objects. Spatial distributions are described with different features such as positional relation, size relation, shape relation, and so on. By combing spatial distribution features with visual and concept features, we establish a modeling method to make these three aspects working together to facilitate visual relationship detection. To evaluate the proposed method, we conduct experiments on two datasets, which are the Stanford VRD dataset, and a newly proposed larger new dataset which contains 15k images. Experimental results demonstrate that our approach is effective.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"142 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134034429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luchao Tian, Mingchen Li, Guyue Zhang, Jingwen Zhao, Y. Chen
{"title":"Robust human detection with super-pixel segmentation and random ferns classification using RGB-D camera","authors":"Luchao Tian, Mingchen Li, Guyue Zhang, Jingwen Zhao, Y. Chen","doi":"10.1109/ICME.2017.8019303","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019303","url":null,"abstract":"Efficient and robust detection of humans has received great attention during the past few decades. This paper presents a two-staged approach for human detection in RGB-D images. As the traditional sliding window-based methods for target localization are often time-consuming, we propose to use the super-pixel method in depth data to efficiently locate the plausible head-top locations in the first stage. In the second stage, we propose to use Random Ferns to seek the features by combining information from different image spaces, which can select the most discriminative features and compute simple and fast Local Binary Features (LBFs) allowing for real-time applications. We evaluate our method on three publicly available challenging datasets taken by a Kinect camera. Experimental results demonstrate that the proposed approach can robustly detect humans in complicated environments.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"31 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113931671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep networks for compressed image sensing","authors":"Wuzhen Shi, F. Jiang, Shengping Zhang, Debin Zhao","doi":"10.1109/ICME.2017.8019428","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019428","url":null,"abstract":"The compressed sensing (CS) theory has been successfully applied to image compression in the past few years as most image signals are sparse in a certain domain. Several CS reconstruction models have been recently proposed and obtained superior performance. However, there still exist two important challenges within the CS theory. The first one is how to design a sampling mechanism to achieve an optimal sampling efficiency, and the second one is how to perform the reconstruction to get the highest quality to achieve an optimal signal recovery. In this paper, we try to deal with these two problems with a deep network. First of all, we train a sampling matrix via the network training instead of using a traditional manually designed one, which is much appropriate for our deep network based reconstruct process. Then, we propose a deep network to recover the image, which imitates traditional compressed sensing reconstruction processes. Experimental results demonstrate that our deep networks based CS reconstruction method offers a very significant quality improvement compared against state-of-the-art ones.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128819747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jordan B. L. Smith, Masahiro Hamasaki, Masataka Goto
{"title":"Classifying derivative works with search, text, audio and video features","authors":"Jordan B. L. Smith, Masahiro Hamasaki, Masataka Goto","doi":"10.1109/ICME.2017.8019444","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019444","url":null,"abstract":"Users of video-sharing sites often search for derivative works of music, such as live versions, covers, and remixes. Audio and video content are both important for retrieval: “karaoke” specifies audio content (instrumental version) and video content (animated lyrics). Although YouTube's text search is fairly reliable, many search results do not match the exact query. We introduce an algorithm to classify YouTube videos by category of derivative work. Based on a standard pipeline for video-based genre classification, it combines search, text, and video features with a novel set of audio features derived from audio fingerprints. A baseline approach is outperformed by the search and text features alone, and combining these with video and audio features performs best of all, reducing the audio content error rate from 25% to 15%.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116709028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weimian Li, Wenmin Wang, Xiongtao Chen, Jinzhuo Wang, Ge Li
{"title":"A joint model for action localization and classification in untrimmed video with visual attention","authors":"Weimian Li, Wenmin Wang, Xiongtao Chen, Jinzhuo Wang, Ge Li","doi":"10.1109/ICME.2017.8019335","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019335","url":null,"abstract":"In this paper, we introduce a joint model that learns to directly localize the temporal bounds of actions in untrimmed videos as well as precisely classify what actions occur. Most existing approaches tend to scan the whole video to generate action instances, which are really inefficient. Instead, inspired by human perception, our model is formulated based on a recurrent neural network to observe different locations within a video over time. And, it is capable of producing temporal localizations by only observing a fixed number of fragments, and the amount of computation it performs is independent of input video size. The decision policy for determining where to look next is learned by REINFORCE which is powerful in non-differentiable settings. In addition, different from relevant ways, our model runs localization and classification serially, and possesses a strategy for extracting appropriate features to classify. We evaluate our model on ActivityNet dataset, and it greatly outperforms the baseline. Moreover, compared with a recent approach, we show that our serial design can bring about 9% increase in detection performance.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"138 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120843102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A study on lidar data forensics","authors":"K. Bahirat, B. Prabhakaran","doi":"10.1109/ICME.2017.8019395","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019395","url":null,"abstract":"3D LiDAR (Light Imaging Detection and Ranging) data has recently been used in a wide range of applications such as vehicle automation and crime scene reconstruction. Decision making in such applications is highly dependent on LiDAR data. Thus, it becomes crucial to authenticate the data before using it. Though authentication of 2D digital images and video has been widely studied, the area of 3D data forensic is relatively unexplored. In this paper, we investigate and identify three possible attacks on the LiDAR data. We also propose two novel forensic approaches as a countermeasure for such attacks and study their effectiveness. The first forensic approach utilises the density consistency check while the second method leverages the occlusion effect for revealing the forgery. Experimental results demonstrate the effectiveness of the proposed forgery attacks and raise the awareness against unauthenticated use of LiDAR data. The performance analyses of the proposed forensic approaches indicate that the proposed methods are very efficient and provide the detection accuracy of more than 95% for certain kinds of forgery attacks. While the forensic approach is unable to handle all forgery attacks, the study motivates to explore more sophisticated forensic methods for LiDAR data.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127344153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Sun, Jiasong Zhu, Zhuo Lei, Xianxu Hou, Qian Zhang, Jiang Duan, G. Qiu
{"title":"Learning deep semantic attributes for user video summarization","authors":"Ke Sun, Jiasong Zhu, Zhuo Lei, Xianxu Hou, Qian Zhang, Jiang Duan, G. Qiu","doi":"10.1109/ICME.2017.8019411","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019411","url":null,"abstract":"This paper presents a Semantic Attribute assisted video SUMmarization framework (SASUM). Compared with traditional methods, SASUM has several innovative features. Firstly, we use a natural language processing tool to discover a set of keywords from an image and text corpora to form the semantic attributes of visual contents. Secondly, we train a deep convolution neural network to extract visual features as well as predict the semantic attributes of video segments which enables us to represent video contents with visual and semantic features simultaneously. Thirdly, we construct a temporally constrained video segment affinity matrix and use a partially near duplicate image discovery technique to cluster visually and semantically consistent video frames together. These frame clusters can then be condensed to form an informative and compact summary of the video. We will present experimental results to show the effectiveness of the semantic attributes in assisting the visual features in video summarization and our new technique achieves state-of-the-art performance.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"497 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123565153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image restoration via multi-scale non-local total variation regularization","authors":"Jing Mu, Ruiqin Xiong, Xiaopeng Fan, Siwei Ma","doi":"10.1109/ICME.2017.8019463","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019463","url":null,"abstract":"Total-variation (TV) regularization is widely adopted in image restoration problems to exploit the local smoothness of image. However, traditional TV regularization only models the sparsity of image gradient at the original scale. This paper introduces a multi-scale TV regularization method which models the image gradient sparsity at different scales, and constrains the gradient magnitude of different scales jointly. As different scales extract different frequency of image, our proposed multi-scale regularization method provides constraints for different frequency components. And for each scale, we adaptively estimate the gradient distribution at a particular pixel from a group of nonlocally searched similar patches. Finally, experimental results demonstrate that the proposed method outperforms the conventional TV regularization methods for image restoration.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125497281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}