Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献_第3页

MAD '22 Workshop: Multimedia AI against Disinformation MAD '22研讨会:多媒体人工智能对抗虚假信息

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531440

B. Ionescu, Giorgos Kordopatis-Zilos, Adrian Daniel Popescu, L. Cuccovillo, S. Papadopoulos

引用次数: 1

VAC-Net: Visual Attention Consistency Network for Person Re-identification 人再识别的视觉注意一致性网络

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531409

W. Shi, Yunzhou Zhang, Shangdong Zhu, Yixiu Liu, Sonya A. Coleman, D. Kerr

{"title":"VAC-Net: Visual Attention Consistency Network for Person Re-identification","authors":"W. Shi, Yunzhou Zhang, Shangdong Zhu, Yixiu Liu, Sonya A. Coleman, D. Kerr","doi":"10.1145/3512527.3531409","DOIUrl":"https://doi.org/10.1145/3512527.3531409","url":null,"abstract":"Person re-identification (ReID) is a crucial aspect of recognising pedestrians across multiple surveillance cameras. Even though significant progress has been made in recent years, the viewpoint change and scale variations still affect model performance. In this paper, we observe that it is beneficial for the model to handle the above issues when boost the consistent feature extraction capability among different transforms (e.g., flipping and scaling) of the same image. To this end, we propose a visual attention consistency network (VAC-Net). Specifically, we propose Embedding Spatial Consistency (ESC) architecture with flipping, scaling and original forms of the same image as inputs to learn a consistent embedding space. Furthermore, we design an Input-Wise visual attention consistent loss (IW-loss) so that the class activation maps(CAMs) from the three transforms are aligned with each other to enforce their advanced semantic information remains consistent. Finally, we propose a Layer-Wise visual attention consistent loss (LW-loss) to further enforce the semantic information among different stages to be consistent with the CAMs within each branch. These two losses can effectively improve the model to address the viewpoint and scale variations. Experiments on the challenging Market-1501, DukeMTMC-reID, and MSMT17 datasets demonstrate the effectiveness of the proposed VAC-Net.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126983993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Reproducibility Companion Paper: Human Object Interaction Detection via Multi-level Conditioned Network 可重复性同伴论文:通过多层次条件网络的人与物体交互检测

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531438

Yunqing He, Xu Sun, Hui Jiang, Tongwei Ren, Gangshan Wu, M. Astefanoaei, Andreas Leibetseder

引用次数: 0

SA-NAS-BFNR: Spatiotemporal Attention Neural Architecture Search for Task-based Brain Functional Network Representation 基于任务的脑功能网络表征的时空注意神经结构搜索

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531421

Fenxia Duan, Chunhong Cao, Xieping Gao

{"title":"SA-NAS-BFNR: Spatiotemporal Attention Neural Architecture Search for Task-based Brain Functional Network Representation","authors":"Fenxia Duan, Chunhong Cao, Xieping Gao","doi":"10.1145/3512527.3531421","DOIUrl":"https://doi.org/10.1145/3512527.3531421","url":null,"abstract":"The spatiotemporal representation of task-based brain functional networks is a key topic in functional magnetic resonance image (fMRI) research. At present, deep learning has been more powerful and flexible in brain functional network research than traditional methods. However, the dominant deep learning models failed in capturing the long-distance dependency (LDD) in task-based fMRI images (tfMRI) due to the time correlation among different task stimuli, the nature between temporal and spatial dimensions, which resulting in inaccurate brain pattern extraction. To address this issue, this paper proposes a spatiotemporal attention neural architecture search (NAS) model for task-based brain functional networks representation (SA-NAS-BFNR), where attention mechanism and gate recurrent unit (GRU) are integrated into a novel framework and GRU structure is searched by the differentiable neural architecture search. This model can not only achieve meaningful brain functional networks (BFNs) by addressing the LDD, but also simplify the existing recurrent structure models in tfMRI. Experiments show that the proposed model is capable of improving the fitting ability between time series and task stimulus sequence, and extracting the BFNs effectively as well.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121537548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning 自我提升:无监督语音-人脸联想学习的新框架

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531364

Guang Chen, Deyuan Zhang, Tao Liu, Xiaoyong Du

{"title":"Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning","authors":"Guang Chen, Deyuan Zhang, Tao Liu, Xiaoyong Du","doi":"10.1145/3512527.3531364","DOIUrl":"https://doi.org/10.1145/3512527.3531364","url":null,"abstract":"Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of \"clustering\" and \"metric learning\". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"55 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113969478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

FedNKD: A Dependable Federated Learning Using Fine-tuned Random Noise and Knowledge Distillation 基于微调随机噪声和知识蒸馏的可靠联邦学习

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531372

Shaoxiong Zhu, Q. Qi, Zirui Zhuang, Jingyu Wang, Haifeng Sun, J. Liao

{"title":"FedNKD: A Dependable Federated Learning Using Fine-tuned Random Noise and Knowledge Distillation","authors":"Shaoxiong Zhu, Q. Qi, Zirui Zhuang, Jingyu Wang, Haifeng Sun, J. Liao","doi":"10.1145/3512527.3531372","DOIUrl":"https://doi.org/10.1145/3512527.3531372","url":null,"abstract":"Multimedia retrieval models need the ability to extract useful information from large-scale data for clients. As an important part of multimedia retrieval, image classification model directly affects the efficiency and effect of multimedia retrieval. We need a lot of data to train a image classification model applied to multimedia retrieval task. However, with the protection of data privacy, the data used to train the model often needs to be kept on the client side. Federated learning is proposed to use data from all clients to train one model while protecting privacy. When federated learning is applied, the distribution of data across different clients varies greatly. Disregarding this problem yields a final model with unstable performance. To enable federated learning to work dependably in the real world with complex data environments, we propose FedNKD, which utilizes knowledge distillation and random noise. The superior knowledge of each client is distilled into a central server to mitigate the instablity caused by Non-IID data. Importantly, a synthetic dataset is created by some random noise through back propagation of neural networks. The synthetic dataset will contain the abstract features of the real data. Then we will use this synthetic dataset to realize the knowledge distillation while protecting users' privacy. In our experimental scenarios, FedNKD outperforms existing representative algorithms by about 1.5% in accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128271311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Review of Deep Learning Models for Spine Segmentation 脊柱分割的深度学习模型综述

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531356

Neng Zhou, Hairu Wen, Yi Wang, Yang Liu, Longfei Zhou

{"title":"Review of Deep Learning Models for Spine Segmentation","authors":"Neng Zhou, Hairu Wen, Yi Wang, Yang Liu, Longfei Zhou","doi":"10.1145/3512527.3531356","DOIUrl":"https://doi.org/10.1145/3512527.3531356","url":null,"abstract":"Medical image segmentation has been a long-standing chal- lenge due to the limitation in labeled datasets and the exis- tence of noise and artifacts. In recent years, deep learning has shown its capability in achieving successive progress in this field, making its automatic segmentation performance gradually catch up with that of manual segmentation. In this paper, we select twelve state-of-the-art models and compare their performance in the spine MRI segmentation task. We divide them into two categories. One of them is the U-Net family, including U-Net, Attention U-Net, ResUNet++, TransUNet, and MiniSeg. The architectures of these models often ultimately include the encoder-decoder structure, and their innovation generally lies in the way of better fusing low-level and high-level information. Models in the other category, named Models Using Backbone often use ResNet, Res2Net, or other pre-trained models on ImageNet as the backbone to extract information. These models pay more attention capturing multi-scale and rich contextual information. All models are trained and tested on the open-source spine M- RI dataset with 20 labels and no pre-training. Through the comparison, the models using backbone exceed U-Net family, and DeepLabv3+ works best. We suppose it is also necessary to extract multi-scale information in a multi-label medical segmentation task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128287946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment 视频字幕:通过动态时间对齐匹配弱同步序列

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531371

Ben Xue, Chenchen Liu, Yadong Mu

{"title":"Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment","authors":"Ben Xue, Chenchen Liu, Yadong Mu","doi":"10.1145/3512527.3531371","DOIUrl":"https://doi.org/10.1145/3512527.3531371","url":null,"abstract":"This paper investigates a new research task in multimedia analysis, dubbed as Video2Subtitle. The goal of this task is to finding the most plausible subtitle from a large pool for a querying video clip. We assume that the temporal duration of each sentence in a subtitle is unknown. Compared with existing cross-modal matching tasks, the proposed Video2Subtitle confronts several new challenges. In particular, video frames / subtitle sentences are temporally ordered, respectively, yet no precise synchronization is available. This casts Video2Subtitle into a problem of matching weakly-synchronized sequences. In this work, our technical contributions are two-fold. First, we construct a large-scale benchmark for the Video2Subtitle task. It consists of about 100K video clip / subtitle pairs with a full duration of 759 hours. All data are automatically trimmed from conversational sub-parts of movies and youtube videos. Secondly, an ideal algorithm for tackling Video2Subtitle requires both temporal synchronization of the visual / textual sequences, but also strong semantic consistency between two modalities. To this end, we propose a novel algorithm with the key traits of heterogeneous multi-cue fusion and dynamic temporal alignment. The proposed method demonstrates excellent performances in comparison with several state-of-the-art cross-modal matching methods. Additionally, we also depict a few interesting applications of Video2Subtitle, such as re-generating subtitle for given videos.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134259419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Modal Contrastive Pre-training for Recommendation 推荐的多模态对比预训练

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531378

Zhuang Liu, Yunpu Ma, Matthias Schubert, Y. Ouyang, Zhang Xiong

{"title":"Multi-Modal Contrastive Pre-training for Recommendation","authors":"Zhuang Liu, Yunpu Ma, Matthias Schubert, Y. Ouyang, Zhang Xiong","doi":"10.1145/3512527.3531378","DOIUrl":"https://doi.org/10.1145/3512527.3531378","url":null,"abstract":"Personalized recommendation plays a central role in various online applications. To provide quality recommendation service, it is of crucial importance to consider multi-modal information associated with users and items, e.g., review text, description text, and images. However, many existing approaches do not fully explore and fuse multiple modalities. To address this problem, we propose a multi-modal contrastive pre-training model for recommendation. We first construct a homogeneous item graph and a user graph based on the relationship of co-interaction. For users, we propose intra-modal aggregation and inter-modal aggregation to fuse review texts and the structural information of the user graph. For items, we consider three modalities: description text, images, and item graph. Moreover, the description text and image complement each other for the same item. One of them can be used as promising supervision for the other. Therefore, to capture this signal and better exploit the potential correlation of intra-modalities, we propose a self-supervised contrastive inter-modal alignment task to make the textual and visual modalities as similar as possible. Then, we apply inter-modal aggregation to obtain the multi-modal representation of items. Next, we employ a binary cross-entropy loss function to capture the potential correlation between users and items. Finally, we fine-tune the pre-trained multi-modal representations using an existing recommendation model. We have performed extensive experiments on three real-world datasets. Experimental results verify the rationality and effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116558202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Weakly-supervised Cerebrovascular Segmentation Network with Shape Prior and Model Indicator 基于形状先验和模型指标的弱监督脑血管分割网络

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531377

Qianrun Wu, Yufei Chen, Ning Huang, Xiaodong Yue

{"title":"Weakly-supervised Cerebrovascular Segmentation Network with Shape Prior and Model Indicator","authors":"Qianrun Wu, Yufei Chen, Ning Huang, Xiaodong Yue","doi":"10.1145/3512527.3531377","DOIUrl":"https://doi.org/10.1145/3512527.3531377","url":null,"abstract":"Labeling cerebral vessels requires domain knowledge in neurology and could be extremely laborious, and there is a scarcity of public annotated cerebrovascular datasets. Traditional machine learning or statistical models could yield decent results on thick vessels with high contrast while having poor performance on those regions of low contrast. In our work, we employ a statistic model as noisy labels and propose a Transformer-based architecture which utilizes Hessian shape prior as soft supervision. It enhances the learning ability of the network to tubular structures, so that the model can make more accurate predictions on refined cerebrovascular segmentation. Furthermore, to combat the overfitting towards noisy labels as model training, we introduce an effective label extension strategy that only calls for a few manual strokes on one sample. These supplementary labels are not used for supervision but only as an indicator to tell where the model keeps the most generalization capability, so as to further guide the model selection in validation. Our experiments are carried out on a public TOF-MRA dataset from MIDAS data platform, and the results demonstrate that our method shows superior performance on cerebrovascular segmentation which achieves Dice of 0.831±0.040 in the dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121043427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8