{"title":"Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval","authors":"Yiling Wu, Shuhui Wang, Qingming Huang","doi":"10.1145/3240508.3240521","DOIUrl":"https://doi.org/10.1145/3240508.3240521","url":null,"abstract":"This paper learns semantic embeddings for multi-label cross-modal retrieval. Our method exploits the structure in semantics represented by label vectors to guide the learning of embeddings. First, we construct a semantic graph based on label vectors which incorporates data from both modalities, and enforce the embeddings to preserve the local structure of this semantic graph. Second, we enforce the embeddings to well reconstruct the labels, i.e., the global semantic structure. In addition, we encourage the embeddings to preserve local geometric structure of each modality. Accordingly, the local and global semantic structure consistencies as well as the local geometric structure consistency are enforced, simultaneously. The mappings between inputs and embeddings are designed to be nonlinear neural network with larger capacity and more flexibility. The overall objective function is optimized by stochastic gradient descent to gain the scalability on large datasets. Experiments conducted on three real world datasets clearly demonstrate the superiority of our proposed approach over the state-of-the-art methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Learning of 3D Model Reconstruction from Hand-Drawn Sketches","authors":"Lingjing Wang, Cheng Qian, Jifei Wang, Yi Fang","doi":"10.1145/3240508.3240699","DOIUrl":"https://doi.org/10.1145/3240508.3240699","url":null,"abstract":"3D objects modeling has gained considerable attention in the visual computing community. We propose a low-cost unsupervised learning model for 3D objects reconstruction from hand-drawn sketches. Recent advancements in deep learning opened new opportunities to learn high-quality 3D objects from 2D sketches via supervised networks. However, the limited availability of labeled 2D hand-drawn sketches data (i.e. sketches and its corresponding 3D ground truth models) hinders the training process of supervised methods. In this paper, driven by a novel design of combination of retrieval and reconstruction process, we developed a learning paradigm to reconstruct 3D objects from hand-drawn sketches, without the use of well-labeled hand-drawn sketch data during the entire training process. Specifically, the paradigm begins with the training of an adaption network via autoencoder with adversarial loss, embedding the unpaired 2D rendered image domain with the hand-drawn sketch domain to a shared latent vector space. Then from the embedding latent space, for each testing sketch image, we retrieve a few (e.g. five) nearest neighbors from the training 3D data set as prior knowledge for a 3D Generative Adversarial Network. Our experiments verify our network's robust and superior performance in handling 3D volumetric object generation from single hand-drawn sketch without requiring any 3D ground truth labels.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126644567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attentive Interactive Convolutional Matching for Community Question Answering in Social Multimedia","authors":"Jun Hu, Shengsheng Qian, Quan Fang, Changsheng Xu","doi":"10.1145/3240508.3240626","DOIUrl":"https://doi.org/10.1145/3240508.3240626","url":null,"abstract":"Nowadays, community-based question answering (CQA) services have accumulated millions of users to share valuable knowledge. An essential function in CQA tasks is the accurate matching of answers w.r.t given questions. Existing methods usually ignore the redundant, heterogeneous, and multi-modal properties of CQA systems. In this paper, we propose a multi-modal attentive interactive convolutional matching method (MMAICM) to model the multi-modal content and social context jointly for questions and answers in a unified framework for CQA retrieval, which explores the redundant, heterogeneous, and multi-modal properties of CQA systems jointly. A well-designed attention mechanism is proposed to focus on useful word-pair interactions and neglect meaningless and noisy word-pair interactions. Moreover, a multi-modal interaction matrix method and a novel meta-path based network representation approach are proposed to consider the multi-modal content and social context, respectively. The attentive interactive convolutional matching network is proposed to infer the relevance between questions and answers, which can capture both the lexical and the sequential information of the contents. Experiment results on two real-world datasets demonstrate the superior performance of MMAICM compared with other state-of-the-art algorithms.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123339003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling","authors":"Shancheng Fang, Hongtao Xie, Zhengjun Zha, Nannan Sun, Jianlong Tan, Yongdong Zhang","doi":"10.1145/3240508.3240571","DOIUrl":"https://doi.org/10.1145/3240508.3240571","url":null,"abstract":"Recent dominant approaches for scene text recognition are mainly based on convolutional neural network (CNN) and recurrent neural network (RNN), where the CNN processes images and the RNN generates character sequences. Different from these methods, we propose an attention-based architecture1 which is completely based on CNNs. The distinctive characteristics of our method include: (1) the method follows encoder-decoder architecture, in which the encoder is a two-dimensional residual CNN and the decoder is a deep one-dimensional CNN. (2) An attention module that captures visual cues, and a language module that models linguistic rules are designed equally in the decoder. Therefore the attention and language can be viewed as an ensemble to boost predictions jointly. (3) Instead of using a single loss from language aspect, multiple losses from attention and language are accumulated for training the networks in an end-to-end way. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K and ICDAR datasets. The experimental results show our CNN-based method has achieved state-of-the-art performance on several benchmark datasets, even without the use of RNN.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"136 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Bahirat, Umang Shah, A. Cárdenas, B. Prabhakaran
{"title":"ALERT","authors":"K. Bahirat, Umang Shah, A. Cárdenas, B. Prabhakaran","doi":"10.1145/3240508.3241912","DOIUrl":"https://doi.org/10.1145/3240508.3241912","url":null,"abstract":"UPDATE: MANAGEMENT OF ABNORMAL CERVICAL CYTOLOGY Invasive cervical cancer is a preventable disease in large majority of women, as long as preinvasive cervical lesions are effectively detected and treated. The Family PACT Program has adopted the 2006 Consensus Guidelines of the American Society for Colposcopy and Cervical Pathology (ASCCP), which are included with this Alert. KEY POINTS • The purpose of cervical cancer screening is the detection and treatment of high-grade squamous epithelial lesions (CIN 2, 3), adenocarcinoma precursors, and cervical cancers. • Women with biopsy proven CIN 1 should be observed carefully and treated only if the lesion progresses to CIN 2, 3, is persistent for two years or more, or if the woman insists upon early treatment. • An office-based tracking system should be used to ensure that women with abnormal cytology findings have been notified of their results and that those who are being followed are reminded of the need for return visits, tests, and procedures. • The tables included in this Alert summarize the 2006 ASCCP Guidelines, but more comprehensive versions are listed as references. Since not all recommended interventions are Program benefits, please refer to the Family PACT Policies, Procedures and Billing Instructions (PPBI) for more information. QUESTIONS AND ANSWERS What is the role of HPV-DNA testing in women under 21 years old? The new guidelines emphasize that there is no role for HPV-DNA testing in women under 21 years old, since incident HPV infections are common and a positive test result would have no impact on client management. HPV infections in young women are likely to be transient and most will resolve quickly. What is the preferred approach to managing ASC-US? Adolescents with results of ASC-US or LSIL should have repeat cytology in one year, but not HPV testing or colposcopy. Consequently, in women under 21 years old, “reflex HPV tests for ASC-US” must not be ordered when submitting the Pap request to the laboratory. Women 21 years of age and older can be managed by either repeat cytology in six months, reflex HPV-DNA testing, or colposcopy. Why aren’t all women with CIN 1 treated with cryotherapy or LEEP? Of women with CIN 1 lesions, fewer than 20 percent will develop a high grade lesion, with even lower progression rates in adolescents. For women 21 years and older, observation is recommended, with treatment only if the CIN 1 lesion progresses or persists for at least two years. Should all women with CIN 2 or 3 be treated? In general, the treatment for CIN 2 or 3 is cryotherapy or a LEEP procedure. However, the preferred treatment for adolescent and young women with CIN 2 and satisfactory colposcopy is observation, which consists of colposcopy plus cytology every six months for up to 24 months. If the colposcopic pattern worsens or a high grade lesion persists for more than 24 months from diagnosis, treatment is necessary. What are the indications for colposcopy? • Cytology result with A","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114214107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Forgione, A. Carlier, Géraldine Morin, Wei Tsang Ooi, V. Charvillat, P. Yadav
{"title":"DASH for 3D Networked Virtual Environment","authors":"Thomas Forgione, A. Carlier, Géraldine Morin, Wei Tsang Ooi, V. Charvillat, P. Yadav","doi":"10.1145/3240508.3240701","DOIUrl":"https://doi.org/10.1145/3240508.3240701","url":null,"abstract":"DASH is now a widely deployed standard for streaming video content due to its simplicity, scalability, and ease of deployment. In this paper, we explore the use of DASH for a different type of media content -- networked virtual environment (NVE), with different properties and requirements. We organize a polygon soup with textures into a structure that is compatible with DASH MPD (Media Presentation Description), with a minimal set of view-independent metadata for the client to make intelligent decisions about what data to download at which resolution. We also present a DASH-based NVE client that uses a view-dependent and network dependent utility metric to decide what to download, based only on the information in the MPD file. We show that DASH can be used on NVE for 3D content streaming. Our work opens up the possibility of using DASH for highly interactive applications, beyond its current use in video streaming.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116154909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nan Xu, Yanqing Guo, Xin Zheng, Qianyu Wang, Xiangyang Luo
{"title":"Partial Multi-view Subspace Clustering","authors":"Nan Xu, Yanqing Guo, Xin Zheng, Qianyu Wang, Xiangyang Luo","doi":"10.1145/3240508.3240679","DOIUrl":"https://doi.org/10.1145/3240508.3240679","url":null,"abstract":"For many real-world multimedia applications, data are often described by multiple views. Therefore, multi-view learning researches are of great significance. Traditional multi-view clustering methods assume that each view has complete data. However, missing data or partial data are more common in real tasks, which results in partial multi-view learning. Therefore, we propose a novel multi-view clustering method, called Partial Multi-view Subspace Clustering (PMSC), to address the partial multi-view problem. Unlike most existing partial multi-view clustering methods that only learn a new representation of the original data, our method seeks the latent space and performs data reconstruction simultaneously to learn the subspace representation. The learned subspace representation can reveal the underlying subspace structure embedded in original data, leading to a more comprehensive data description. In addition, we enforce the subspace representation to be non-negative, yielding an intuitive weight interpretation among different data. The proposed method can be optimized by the Augmented Lagrange Multiplier (ALM) algorithm. Experiments on one synthetic dataset and four benchmark datasets validate the effectiveness of PMSC under the partial multi-view scenario.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121187751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: FF-1","authors":"C. Changwen","doi":"10.1145/3286915","DOIUrl":"https://doi.org/10.1145/3286915","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116598278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin He, Feng Gao, Daiqian Ma, Boxin Shi, Ling-yu Duan
{"title":"ChipGAN","authors":"Bin He, Feng Gao, Daiqian Ma, Boxin Shi, Ling-yu Duan","doi":"10.1145/3240508.3240655","DOIUrl":"https://doi.org/10.1145/3240508.3240655","url":null,"abstract":"Style transfer has been successfully applied on photos to generate realistic western paintings. However, because of the inherently different painting techniques adopted by Chinese and western paintings, directly applying existing methods cannot generate satisfactory results for Chinese ink wash painting style transfer. This paper proposes ChipGAN, an end-to-end Generative Adversarial Network based architecture for photo to Chinese ink wash painting style transfer. The core modules of ChipGAN enforce three constraints -- voids, brush strokes, and ink wash tone and diffusion -- to address three key techniques commonly adopted in Chinese ink wash painting. We conduct stylization perceptual study to score the similarity of generated paintings to real paintings by consulting with professional artists based on the newly built Chinese ink wash photo and image dataset. The advantages in visual quality compared with state-of-the-art networks and high stylization perceptual study scores show the effectiveness of the proposed method.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121678958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingjia Huang, Nannan Li, Jia-Xing Zhong, Thomas H. Li, Ge Li
{"title":"Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern","authors":"Jingjia Huang, Nannan Li, Jia-Xing Zhong, Thomas H. Li, Ge Li","doi":"10.1145/3240508.3240659","DOIUrl":"https://doi.org/10.1145/3240508.3240659","url":null,"abstract":"At present, spatio-temporal action detection in the video is still a challenging problem, considering the complexity of the background, the variety of the action or the change of the viewpoint in the unconstrained environment. Most of current approaches solve the problem via a two-step processing: first detecting actions at each frame; then linking them, which neglects the continuity of the action and operates in an offline and batch processing manner. In this paper, we attempt to build an online action detection model that introduces the spatio-temporal coherence existed among action regions when performing action category inference and position localization. Specifically, we seek to represent the spatio-temporal context pattern via establishing an encoder-decoder model based on the convolutional recurrent network. The model accepts a video snippet as input and encodes the dynamic information of the action in the forward pass. During the backward pass, it resolves such information at each time instant for action detection via fusing the current static or motion cue. Additionally, we propose an incremental action tube generation algorithm, which accomplishes action bounding-boxes association, action label determination and the temporal trimming in a single pass. Our model takes in the appearance, motion or fused signals as input and is tested on two prevailing datasets, UCF-Sports and UCF-101. The experiment results demonstrate the effectiveness of our method which achieves a performance superior or comparable to compared existing approaches.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125253743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}