{"title":"Align and Adapt: A Two-stage Adaptation Framework for Unsupervised Domain Adaptation","authors":"Yanting Yu, Yuchen Zhai, Yin Zhang","doi":"10.1145/3503161.3547973","DOIUrl":"https://doi.org/10.1145/3503161.3547973","url":null,"abstract":"Unsupervised domain adaptation aims to transfer knowledge from a labeled but heterogeneous source domain to an unlabeled target domain, alleviating the labeling efforts. Early advances in domain adaptation focus on invariant representations learning (IRL) methods to align domain distributions. Recent studies further utilize semi-supervised learning (SSL) methods to regularize domain-invariant representations based on the cluster assumption, making the category boundary more clear. However, the misalignment in the IRL methods might be intensified by SSL methods if the target instances are more proximate to the wrong source centroid, resulting in incompatibility between these techniques. In this paper, we hypothesize this phenomenon derives from the distraction of the source domain, and further give a novel two-stage adaptation framework to adapt the model toward the target domain. In addition, we propose DCAN to reduce the misalignment in IRL methods in the first stage, and we propose PCST to encode the semantic structure of unlabeled target data in the second stage. Extensive experiments demonstrate that our method outperforms current state-of-the-art methods on four benchmarks (Office-31, ImageCLEF-DA, Office-Home, and VisDA-2017).","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relative Pose Estimation for Multi-Camera Systems from Point Correspondences with Scale Ratio","authors":"Banglei Guan, Ji Zhao","doi":"10.1145/3503161.3547788","DOIUrl":"https://doi.org/10.1145/3503161.3547788","url":null,"abstract":"The use of multi-camera systems is becoming more common in self-driving cars, micro aerial vehicles or augmented reality headsets. In order to perform 3D geometric tasks, the accuracy and efficiency of relative pose estimation algorithms are very important for the multi-camera systems, and is catching significant research attention these days. The point coordinates of point correspondences (PCs) obtained from feature matching strategies have been widely used for relative pose estimation. This paper exploits known scale ratios besides the point coordinates, which are also intrinsically provided by scale invariant feature detectors (e.g., SIFT). Two-view geometry of scale ratio associated with the extracted features is derived for multi-camera systems. Thanks to the constraints provided by the scale ratio across two views, the number of PCs needed for relative pose estimation is reduced from 6 to 3. Requiring fewer PCs makes RANSAC-like randomized robust estimation significantly faster. For different point correspondence layouts, four minimal solvers are proposed for typical two-camera rigs. Extensive experiments demonstrate that our solvers have better accuracy than the state-of-the-art ones and outperform them in terms of processing time.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116143177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueling Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, Bo Liu
{"title":"Two-Stream Transformer for Multi-Label Image Classification","authors":"Xueling Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, Bo Liu","doi":"10.1145/3503161.3548343","DOIUrl":"https://doi.org/10.1145/3503161.3548343","url":null,"abstract":"Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"5 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120984151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liming Zhai, Qing Guo, Xiaofei Xie, L. Ma, Yi (Estelle) Wang, Yang Liu
{"title":"A3GAN: Attribute-Aware Anonymization Networks for Face De-identification","authors":"Liming Zhai, Qing Guo, Xiaofei Xie, L. Ma, Yi (Estelle) Wang, Yang Liu","doi":"10.1145/3503161.3547757","DOIUrl":"https://doi.org/10.1145/3503161.3547757","url":null,"abstract":"Face de-identification (De-ID) removes face identity information in face images to avoid personal privacy leakage. Existing face De-ID breaks the raw identity by cutting out the face regions and recovering the corrupted regions via deep generators, which inevitably affect the generation quality and cannot control generation results according to subsequent intelligent tasks (eg., facial expression recognition). In this work, for the first attempt, we think the face De-ID from the perspective of attribute editing and propose an attribute-aware anonymization network (A3GAN) by formulating face De-ID as a joint task of semantic suppression and controllable attribute injection. Intuitively, the semantic suppression removes the identity-sensitive information in embeddings while the controllable attribute injection automatically edits the raw face along the attributes that benefit De-ID. To this end, we first design a multi-scale semantic suppression network with a novel suppressive convolution unit (SCU), which can remove the face identity along multi-level deep features progressively. Then, we propose an attribute-aware injective network (AINet) that can generate De-ID-sensitive attributes in a controllable way (i.e., specifying which attributes can be changed and which cannot) and inject them into the latent code of the raw face. Moreover, to enable effective training, we design a new anonymization loss to let the injected attributes shift far away from the original ones. We perform comprehensive experiments on four datasets covering four different intelligent tasks including face verification, face detection, facial expression recognition, and fatigue detection, all of which demonstrate the superiority of our face De-ID over state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124848286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAPTCHA the Flag: Interactive Plotter Livestream","authors":"Tiago Rorke","doi":"10.1145/3503161.3549961","DOIUrl":"https://doi.org/10.1145/3503161.3549961","url":null,"abstract":"A live diagram of locations identified from the IP addresses of participants. By solving a CAPTCHA participants can contribute to the drawing, holding the 'flag' until it is taken by somebody else. A collaborative plotter drawing, where participants contribute simply by being human and present on the network.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121452906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA","authors":"Zanxia Jin, Mike Zheng Shou, Fang Zhou, Satoshi Tsutsui, Jingyan Qin, Xu-Cheng Yin","doi":"10.1145/3503161.3547977","DOIUrl":"https://doi.org/10.1145/3503161.3547977","url":null,"abstract":"Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as \"pepsi\" being recognized as \"peosi\". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125121484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, Changsheng Xu
{"title":"Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation","authors":"Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, Changsheng Xu","doi":"10.1145/3503161.3548420","DOIUrl":"https://doi.org/10.1145/3503161.3548420","url":null,"abstract":"Micro-video recommendation has attracted extensive research attention with the increasing popularity of micro-video sharing platforms. There exists a substantial amount of excellent efforts made to the micro-video recommendation task. Recently, homogeneous (or heterogeneous) GNN-based approaches utilize graph convolutional operators (or meta-path based similarity measures) to learn meaningful representations for users and micro-videos and show promising performance for the micro-video recommendation task. However, these methods may suffer from the following problems: (1) fail to aggregate information from distant or long-range nodes; (2) ignore the varying intensity of users' preferences for different items in micro-video recommendations; (3) neglect the similarities of multi-modal contents of micro-videos for recommendation tasks. In this paper, we propose a novel Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for personalized micro-video recommendation. Specifically, we design a collaborative representation learning module and a semantic representation learning module to fully exploit user-video interaction information and the similarities of micro-videos, respectively. Furthermore, we utilize an anti-bottleneck module to automatically learn the importance weights of short-range and long-range neighboring nodes to obtain more expressive representations of users and micro-videos. Finally, to consider the varying intensity of users' preferences for different micro-videos, we design and optimize an adaptive recommendation loss to train our model in an end-to-end manner. We evaluate our method on three real-world datasets and the results demonstrate that the proposed model outperforms the baselines.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"11 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113941702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Mougiakakou, G. Farinella, K. Yanai, D. Allegra
{"title":"MADiMa'22: 7th International Workshop on Multimedia Assisted Dietary Management","authors":"S. Mougiakakou, G. Farinella, K. Yanai, D. Allegra","doi":"10.1145/3503161.3554771","DOIUrl":"https://doi.org/10.1145/3503161.3554771","url":null,"abstract":"This abstract provides a summary and overview of the 7th International Workshop on Multimedia Assisted Dietary Management. Related Workshop Proceedings are available in the ACM DL at: https://dl.acm.org/doi/proceedings/10.1145/3552484","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122812723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SoftSkip","authors":"Dulanga Weerakoon, Vigneshwaran Subbaraju, Tuan Tran, Archan Misra","doi":"10.1145/3503161.3548432","DOIUrl":"https://doi.org/10.1145/3503161.3548432","url":null,"abstract":"Supporting real-time referring expression comprehension (REC) on pervasive devices is an important capability for human-AI collaborative tasks. Model pruning techniques, applied to DNN models, can enable real-time execution even on resource-constrained devices. However, existing pruning strategies are designed principally for uni-modal applications, and suffer a significant loss of accuracy when applied to REC tasks that require fusion of textual and visual inputs. We thus present a multi-modal pruning model, LGMDP, which uses language as a pivot to dynamically and judiciously select the relevant computational blocks that need to be executed. LGMDP also introduces a new SoftSkip mechanism, whereby 'skipped' visual scales are not completely eliminated but approximated with minimal additional computation. Experimental evaluation, using 3 benchmark REC datasets and an embedded device implementation, shows that LGMDP can achieve 33% latency savings, with an accuracy loss 0.5% - 2%.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122821444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guohao Li, Hu Yang, Feng He, Zhifan Feng, Yajuan Lyu, Hua Wu, Haifeng Wang
{"title":"CLOP: Video-and-Language Pre-Training with Knowledge Regularizations","authors":"Guohao Li, Hu Yang, Feng He, Zhifan Feng, Yajuan Lyu, Hua Wu, Haifeng Wang","doi":"10.1145/3503161.3548346","DOIUrl":"https://doi.org/10.1145/3503161.3548346","url":null,"abstract":"Video-and-language pre-training has shown promising results for learning generalizable representations. Most existing approaches usually model video and text in an implicit manner, without considering explicit structural representations of the multi-modal content. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. There are related works that propose object-aware approaches to inject similar knowledge as inputs. However, the existing methods usually fail to effectively utilize such knowledge as regularizations to shape a superior cross-modal representation space. To this end, we propose a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations. There are two key designs of ours: 1) a simple yet effective Structural Knowledge Prediction (SKP) task to pull together the latent representations of similar videos; and 2) a novel Knowledge-guided sampling approach for Contrastive Learning (KCL) to push apart cross-modal hard negative samples. We evaluate our method on four text-video retrieval tasks and one multi-choice QA task. The experiments show clear improvements, outperforming prior works by a substantial margin. Besides, we provide ablations and insights of how our methods affect the latent representation space, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122862673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}