Proceedings of the 30th ACM International Conference on Multimedia最新文献_第10页

Align and Adapt: A Two-stage Adaptation Framework for Unsupervised Domain Adaptation 对齐与适应:无监督域自适应的两阶段自适应框架

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547973

Yanting Yu, Yuchen Zhai, Yin Zhang

{"title":"Align and Adapt: A Two-stage Adaptation Framework for Unsupervised Domain Adaptation","authors":"Yanting Yu, Yuchen Zhai, Yin Zhang","doi":"10.1145/3503161.3547973","DOIUrl":"https://doi.org/10.1145/3503161.3547973","url":null,"abstract":"Unsupervised domain adaptation aims to transfer knowledge from a labeled but heterogeneous source domain to an unlabeled target domain, alleviating the labeling efforts. Early advances in domain adaptation focus on invariant representations learning (IRL) methods to align domain distributions. Recent studies further utilize semi-supervised learning (SSL) methods to regularize domain-invariant representations based on the cluster assumption, making the category boundary more clear. However, the misalignment in the IRL methods might be intensified by SSL methods if the target instances are more proximate to the wrong source centroid, resulting in incompatibility between these techniques. In this paper, we hypothesize this phenomenon derives from the distraction of the source domain, and further give a novel two-stage adaptation framework to adapt the model toward the target domain. In addition, we propose DCAN to reduce the misalignment in IRL methods in the first stage, and we propose PCST to encode the semantic structure of unlabeled target data in the second stage. Extensive experiments demonstrate that our method outperforms current state-of-the-art methods on four benchmarks (Office-31, ImageCLEF-DA, Office-Home, and VisDA-2017).","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Relative Pose Estimation for Multi-Camera Systems from Point Correspondences with Scale Ratio 基于比例比点对应的多相机系统相对姿态估计

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547788

Banglei Guan, Ji Zhao

{"title":"Relative Pose Estimation for Multi-Camera Systems from Point Correspondences with Scale Ratio","authors":"Banglei Guan, Ji Zhao","doi":"10.1145/3503161.3547788","DOIUrl":"https://doi.org/10.1145/3503161.3547788","url":null,"abstract":"The use of multi-camera systems is becoming more common in self-driving cars, micro aerial vehicles or augmented reality headsets. In order to perform 3D geometric tasks, the accuracy and efficiency of relative pose estimation algorithms are very important for the multi-camera systems, and is catching significant research attention these days. The point coordinates of point correspondences (PCs) obtained from feature matching strategies have been widely used for relative pose estimation. This paper exploits known scale ratios besides the point coordinates, which are also intrinsically provided by scale invariant feature detectors (e.g., SIFT). Two-view geometry of scale ratio associated with the extracted features is derived for multi-camera systems. Thanks to the constraints provided by the scale ratio across two views, the number of PCs needed for relative pose estimation is reduced from 6 to 3. Requiring fewer PCs makes RANSAC-like randomized robust estimation significantly faster. For different point correspondence layouts, four minimal solvers are proposed for typical two-camera rigs. Extensive experiments demonstrate that our solvers have better accuracy than the state-of-the-art ones and outperform them in terms of processing time.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116143177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Two-Stream Transformer for Multi-Label Image Classification 用于多标签图像分类的双流变压器

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548343

Xueling Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, Bo Liu

{"title":"Two-Stream Transformer for Multi-Label Image Classification","authors":"Xueling Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, Bo Liu","doi":"10.1145/3503161.3548343","DOIUrl":"https://doi.org/10.1145/3503161.3548343","url":null,"abstract":"Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"5 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120984151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A3GAN: Attribute-Aware Anonymization Networks for Face De-identification 基于属性感知的人脸去识别匿名化网络

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547757

Liming Zhai, Qing Guo, Xiaofei Xie, L. Ma, Yi (Estelle) Wang, Yang Liu

{"title":"A3GAN: Attribute-Aware Anonymization Networks for Face De-identification","authors":"Liming Zhai, Qing Guo, Xiaofei Xie, L. Ma, Yi (Estelle) Wang, Yang Liu","doi":"10.1145/3503161.3547757","DOIUrl":"https://doi.org/10.1145/3503161.3547757","url":null,"abstract":"Face de-identification (De-ID) removes face identity information in face images to avoid personal privacy leakage. Existing face De-ID breaks the raw identity by cutting out the face regions and recovering the corrupted regions via deep generators, which inevitably affect the generation quality and cannot control generation results according to subsequent intelligent tasks (eg., facial expression recognition). In this work, for the first attempt, we think the face De-ID from the perspective of attribute editing and propose an attribute-aware anonymization network (A3GAN) by formulating face De-ID as a joint task of semantic suppression and controllable attribute injection. Intuitively, the semantic suppression removes the identity-sensitive information in embeddings while the controllable attribute injection automatically edits the raw face along the attributes that benefit De-ID. To this end, we first design a multi-scale semantic suppression network with a novel suppressive convolution unit (SCU), which can remove the face identity along multi-level deep features progressively. Then, we propose an attribute-aware injective network (AINet) that can generate De-ID-sensitive attributes in a controllable way (i.e., specifying which attributes can be changed and which cannot) and inject them into the latent code of the raw face. Moreover, to enable effective training, we design a new anonymization loss to let the injected attributes shift far away from the original ones. We perform comprehensive experiments on four datasets covering four different intelligent tasks including face verification, face detection, facial expression recognition, and fatigue detection, all of which demonstrate the superiority of our face De-ID over state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124848286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

CAPTCHA the Flag: Interactive Plotter Livestream 验证标志:交互式绘图仪直播

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3549961

Tiago Rorke

引用次数: 0

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA 从Token到Word:基于对比学习和语义匹配的文本- vqa OCR Token进化

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547977

Zanxia Jin, Mike Zheng Shou, Fang Zhou, Satoshi Tsutsui, Jingyan Qin, Xu-Cheng Yin

{"title":"From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA","authors":"Zanxia Jin, Mike Zheng Shou, Fang Zhou, Satoshi Tsutsui, Jingyan Qin, Xu-Cheng Yin","doi":"10.1145/3503161.3547977","DOIUrl":"https://doi.org/10.1145/3503161.3547977","url":null,"abstract":"Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text is usually recognized by Optical Character Recognition (OCR) systems. However, the text from OCR systems often includes spelling errors, such as \"pepsi\" being recognized as \"peosi\". These OCR errors are one of the major challenges for Text-VQA systems. To address this, we propose a novel Text-VQA method to alleviate OCR errors via OCR token evolution. First, we artificially create the misspelled OCR tokens in the training time, and make the system more robust to the OCR errors. To be specific, we propose an OCR Token-Word Contrastive (TWC) learning task, which pre-trains word representation by augmenting OCR tokens via the Levenshtein distance between the OCR tokens and words in a dictionary. Second, by assuming that the majority of characters in misspelled OCR tokens are still correct, a multimodal transformer is proposed and fine-tuned to predict the answer using character-based word embedding. Specifically, we introduce a vocabulary predictor with character-level semantic matching, which enables the model to recover the correct word from the vocabulary even with misspelled OCR tokens. A variety of experimental evaluations show that our method outperforms the state-of-the-art methods on both TextVQA and ST-VQA datasets. The code will be released at https://github.com/xiaojino/TWA.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125121484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation 个性化微视频推荐的自适应抗瓶颈多模态图学习网络

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548420

Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, Changsheng Xu

{"title":"Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation","authors":"Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, Changsheng Xu","doi":"10.1145/3503161.3548420","DOIUrl":"https://doi.org/10.1145/3503161.3548420","url":null,"abstract":"Micro-video recommendation has attracted extensive research attention with the increasing popularity of micro-video sharing platforms. There exists a substantial amount of excellent efforts made to the micro-video recommendation task. Recently, homogeneous (or heterogeneous) GNN-based approaches utilize graph convolutional operators (or meta-path based similarity measures) to learn meaningful representations for users and micro-videos and show promising performance for the micro-video recommendation task. However, these methods may suffer from the following problems: (1) fail to aggregate information from distant or long-range nodes; (2) ignore the varying intensity of users' preferences for different items in micro-video recommendations; (3) neglect the similarities of multi-modal contents of micro-videos for recommendation tasks. In this paper, we propose a novel Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for personalized micro-video recommendation. Specifically, we design a collaborative representation learning module and a semantic representation learning module to fully exploit user-video interaction information and the similarities of micro-videos, respectively. Furthermore, we utilize an anti-bottleneck module to automatically learn the importance weights of short-range and long-range neighboring nodes to obtain more expressive representations of users and micro-videos. Finally, to consider the varying intensity of users' preferences for different micro-videos, we design and optimize an adaptive recommendation loss to train our model in an end-to-end manner. We evaluate our method on three real-world datasets and the results demonstrate that the proposed model outperforms the baselines.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"11 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113941702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

MADiMa'22: 7th International Workshop on Multimedia Assisted Dietary Management 第七届多媒体辅助饮食管理国际研讨会

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3554771

S. Mougiakakou, G. Farinella, K. Yanai, D. Allegra

引用次数: 0

SoftSkip

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548432

Dulanga Weerakoon, Vigneshwaran Subbaraju, Tuan Tran, Archan Misra

引用次数: 2

CLOP: Video-and-Language Pre-Training with Knowledge Regularizations CLOP:具有知识规范化的视频和语言预训练

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548346

Guohao Li, Hu Yang, Feng He, Zhifan Feng, Yajuan Lyu, Hua Wu, Haifeng Wang

{"title":"CLOP: Video-and-Language Pre-Training with Knowledge Regularizations","authors":"Guohao Li, Hu Yang, Feng He, Zhifan Feng, Yajuan Lyu, Hua Wu, Haifeng Wang","doi":"10.1145/3503161.3548346","DOIUrl":"https://doi.org/10.1145/3503161.3548346","url":null,"abstract":"Video-and-language pre-training has shown promising results for learning generalizable representations. Most existing approaches usually model video and text in an implicit manner, without considering explicit structural representations of the multi-modal content. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. There are related works that propose object-aware approaches to inject similar knowledge as inputs. However, the existing methods usually fail to effectively utilize such knowledge as regularizations to shape a superior cross-modal representation space. To this end, we propose a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations. There are two key designs of ours: 1) a simple yet effective Structural Knowledge Prediction (SKP) task to pull together the latent representations of similar videos; and 2) a novel Knowledge-guided sampling approach for Contrastive Learning (KCL) to push apart cross-modal hard negative samples. We evaluate our method on four text-video retrieval tasks and one multi-choice QA task. The experiments show clear improvements, outperforming prior works by a substantial margin. Besides, we provide ablations and insights of how our methods affect the latent representation space, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122862673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0