{"title":"The OUC-vision large-scale underwater image database","authors":"Muwei Jian, Qiang Qi, Junyu Dong, Yinlong Yin, Wenyin Zhang, K. Lam","doi":"10.1109/ICME.2017.8019324","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019324","url":null,"abstract":"In this paper, a large-scale underwater image database for underwater salient object detection or saliency detection is presented in detail. This database is called the OUC-VISION underwater image database, which contains 4400 underwater images of 220 individual objects. Each object is captured with four pose variations (the frontal-, the opposite-, the left-, and the right-views of each underwater object) and five spatial locations (the underwater object is located at the top-left corner, the top-right corner, the center, the bottom-left corner, and the bottom-right corner) to obtain 20 images. Meanwhile, this publicly available OUC-VISION database also provides relevant industrial fields, and academic researchers with underwater images under different sources of variations, especially pose, spatial location, illumination, turbidity of water, etc. Ground-truth information is also manually labelled for this database. The OUC-VISION database can not only be widely used to assess and evaluate the performance of the state-of-the-art salient-object detection and saliency-detection algorithms for general images, but also will particularly benefit the development of underwater vision technology in the future.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125264697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast 3D point cloud segmentation using supervoxels with geometry and color for 3D scene understanding","authors":"Francesco Verdoja, D. Thomas, A. Sugimoto","doi":"10.1109/ICME.2017.8019382","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019382","url":null,"abstract":"Segmentation of 3D colored point clouds is a research field with renewed interest thanks to recent availability of inexpensive consumer RGB-D cameras and its importance as an unavoidable low-level step in many robotic applications. However, 3D data's nature makes the task challenging and, thus, many different techniques are being proposed, all of which require expensive computational costs. This paper presents a novel fast method for 3D colored point cloud segmentation. It starts with supervoxel partitioning of the cloud, i.e., an oversegmentation of the points in the cloud. Then it leverages on a novel metric exploiting both geometry and color to iteratively merge the supervoxels to obtain a 3D segmentation where the hierarchical structure of partitions is maintained. The algorithm also presents computational complexity linear to the size of the input. Experimental results over two publicly available datasets demonstrate that our proposed method outperforms state-of-the-art techniques.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114845153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward the realization of six degrees-of-freedom with compressed light fields","authors":"A. Hinds, D. Doyen, P. Carballeira","doi":"10.1109/ICME.2017.8019543","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019543","url":null,"abstract":"360° video, supporting the ability to present views consistent with the rotation of the viewer's head along three axes (roll, pitch, yaw) is the current approach for creation of immersive video experiences. Nevertheless, a more fully natural, photorealistic experience — with support of visual cues that facilitate coherent psycho-visual sensory fusion without the side-effect of cyber-sickness — is desired. 360° video applications that additionally enable the user to translate in x, y, and z directions are clearly a subsequent frontier to be realized toward the goal of sensory fusion without cyber-sickness. Such support of full Six Degrees-of-Freedom (6 DoF) for next generation immersive video is a natural application for light fields. However, a significant obstacle to the adoption of light field technologies is the large data necessary to ensure that the light rays corresponding to the viewer's position relative to 6-DoF are properly delivered, either from captured light information or synthesized from available views. Experiments to improve known methods for view synthesis and depth estimation are therefore a fundamental next step to establish a reference framework within which compression technologies can be evaluated. This paper describes a testbed and experiments to enable smooth and artefact-free view transitions that can later be used in a framework to study how best to compress the data.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123136048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition","authors":"Che-Wei Huang, Shrikanth S. Narayanan","doi":"10.1109/ICME.2017.8019296","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019296","url":null,"abstract":"We present a deep convolutional recurrent neural network for speech emotion recognition based on the log-Mel filterbank energies, where the convolutional layers are responsible for the discriminative feature learning. Based on the hypothesis that a better understanding of the internal configuration within an utterance would help reduce misclassification, we further propose a convolutional attention mechanism to learn the utterance structure relevant to the task. In addition, we quantitatively measure the performance gain contributed by each module in our model in order to characterize the nature of emotion expressed in speech. The experimental results on the eNTERFACE'05 emotion database validate our hypothesis and also demonstrate an absolute improvement by 4.62% compared to the state-of-the-art approach.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122448941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptual quality assessment of 3D synthesized images","authors":"M. S. Farid, M. Lucenteforte, Marco Grangetto","doi":"10.1109/ICME.2017.8019307","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019307","url":null,"abstract":"Multiview video plus depth (MVD) is the most popular 3D video format where the texture images contain the color information and the depth maps represent the geometry of the scene. The depth maps are exploited to obtain intermediate views to enable 3D-TV and free-viewpoint applications using the depth image based rendering (DIBR) techniques. DIBR is used to get an estimate of the intermediate views but has to cope with depth errors, occlusions, imprecise camera parameters, re-interpolation, to mention a few issues. Therefore, being able to evaluate the true perceptual quality of synthesized images is of paramount importance for a high quality 3D experience. In this paper, we present a novel algorithm to assess the quality of the synthesized images in the absence of the corresponding references. The algorithm uses the original views from which the virtual image is generated to estimate the distortion induced by the DIBR process. In particular, a block-based perceptual feature matching based on signal phase congruency metric is devised to estimate the synthesis distortion. The experiments worked out on standard DIBR synthesized database show that the proposed algorithm achieves high correlation with the subjective ratings and outperforms the existing 3D quality assessment algorithms.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"24 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131171582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale person re-identification as retrieval","authors":"Hantao Yao, Shiliang Zhang, Dongming Zhang, Yongdong Zhang, Jintao Li, Yu Wang, Q. Tian","doi":"10.1109/ICME.2017.8019485","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019485","url":null,"abstract":"This paper targets to bring together the research efforts on two fields that are growing actively in the past few years: multicamera person Re-Identification (ReID) and large-scale image retrieval. We demonstrate that the essentials of image retrieval and person ReID are the same, i.e., measuring the similarity between images. However, person ReID requires more discriminative and robust features to identify the subtle differences of different persons and overcome the large variance among images of the same person. Specifically, we propose a coarse-to-fine (C2F) framework and a Convolutional Neural Network structure named as Conv-Net to tackle the large-scale person ReID as an image retrieval task. Given a query person image, the C2F firstly employ Conv-Net to extract a compact descriptor and perform the coarse-level search. A robust descriptor conveying more spatial cues is hence extracted to perform the fine-level search. Extensive experimental results show that the proposed method outperforms existing methods on two public datasets. Further, the evaluation on a large-scale Person-520K dataset demonstrates that our work is significantly more efficient than existing works, e.g., only needs 180ms to identify a query person from 520K images.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"17 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131352880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decoder-side HEVC quality enhancement with scalable convolutional neural network","authors":"Ren Yang, Mai Xu, Zulin Wang","doi":"10.1109/ICME.2017.8019299","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019299","url":null,"abstract":"The latest High Efficiency Video Coding (HEVC) has been increasingly used to generate video streams over Internet. However, the decoded HEVC video streams may incur severe quality degradation, especially at low bit-rates. Thus, it is necessary to enhance visual quality of HEVC videos at the decoder side. To this end, we propose in this paper a Decoder-side Scalable Convolutional Neural Network (DS-CNN) approach to achieve quality enhancement for HEVC, which does not require any modification of the encoder. In particular, our DS-CNN approach learns a model of Convo-lutional Neural Network (CNN) to reduce distortion of both I and B/P frames in HEVC. It is different from the existing CNN-based quality enhancement approaches, which only handle intra coding distortion, thus not suitable for B/P frames. Furthermore, a scalable structure is included in our DS-CNN, suchthat the computational complexity of our DS-CNN approach is adjustable to the changing computational resources. Finally, the experimental results show the effectiveness of our DS-CNN approach in enhancing quality for both I and B/P frames of HEVC.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115759284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Q. Wu, Hongliang Li, Fanman Meng, K. Ngan, Linfeng Xu
{"title":"Blind proposal quality assessment via deep objectness representation and local linear regression","authors":"Q. Wu, Hongliang Li, Fanman Meng, K. Ngan, Linfeng Xu","doi":"10.1109/ICME.2017.8019305","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019305","url":null,"abstract":"The quality of object proposal plays an important role in boosting the performance of many computer vision tasks, such as, object detection and recognition. Due to the absence of manually annotated bounding-box in practice, the quality metric towards blind assessment of object proposal is highly desirable for singling out the optimal proposals. In this paper, we propose a blind proposal quality assessment algorithm based on the Deep Objectness Representation and Local Linear Regression (DORLLR). Inspired by the hierarchy model of the human vision system, a deep convolutional neural network is developed to extract the objectness-aware image feature. Then, the local linear regression method is utilized to map the image feature to a quality score, which tries to evaluate each individual test window based on its k-nearest-neighbors. Experimental results on a large-scale IoU labeled dataset verify that the proposed method significantly outperforms the state-of-the-art blind proposal evaluation metrics.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116975124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partially occluded facial action recognition and interaction in virtual reality applications","authors":"U. Ciftci, Xing Zhang, Lijun Tin","doi":"10.1109/ICME.2017.8019545","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019545","url":null,"abstract":"The proliferation of affordable virtual reality (VR) head mounted displays (HMD) provides users with realistic immersive visual experiences. However, HMDs occlude upper half of a user's face and prevent the facial action recognition from the entire face. Therefore, entire face cannot be used as a source of feedback for more interactive virtual reality applications. To tackle this problem, we propose a new depth based recognition framework that recognizes mouth gestures and uses those recognized mouth gestures as a medium of interaction within virtual reality in real-time. Our system uses a new 3D edge map approach to describe mouth features, and further classifies those features into seven different gesture classes. The accuracy of the proposed mouth gesture framework is evaluated in user independent tests and achieved high correct recognition rates. The system has also been demonstrated and validated through a real-time virtual reality application.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129338782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving acoustic modeling using audio-visual speech","authors":"A. H. Abdelaziz","doi":"10.1109/ICME.2017.8019294","DOIUrl":"https://doi.org/10.1109/ICME.2017.8019294","url":null,"abstract":"Reliable visual features that encode the articulator movements of speakers can dramatically improve the decoding accuracy of automatic speech recognition systems when combined with the corresponding acoustic signals. In this paper, a novel framework is proposed to utilize audio-visual speech not only during decoding but also for training better acoustic models. In this framework, a multi-stream hidden Markov model is iteratively deployed to fuse audio and video likelihoods. The fused likelihoods are used to estimate enhanced frame-state alignments, which are finally used as better training targets. The proposed framework is so flexible that it can be partially used to train acoustic models with the available audio-visual data while a conventional training strategy can be followed with the remaining acoustic data. The experimental results show that the acoustic models trained using the proposed audio-visual framework perform significantly better than those trained conventionally with solely acoustic data in clean and noisy conditions.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133412849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}