B. Ionescu, Giorgos Kordopatis-Zilos, Adrian Daniel Popescu, L. Cuccovillo, S. Papadopoulos
{"title":"MAD '22 Workshop: Multimedia AI against Disinformation","authors":"B. Ionescu, Giorgos Kordopatis-Zilos, Adrian Daniel Popescu, L. Cuccovillo, S. Papadopoulos","doi":"10.1145/3512527.3531440","DOIUrl":"https://doi.org/10.1145/3512527.3531440","url":null,"abstract":"The verification of multimedia content posted online becomes increasingly challenging due to recent advancements in synthetic media manipulation and generation. Moreover, malicious actors can easily exploit AI technologies to spread disinformation across social media at a rapid pace, which poses very high risks for society and democracy. There is, therefore, an urgent need for AI-powered tools that facilitate the media verification process. The objective of the MAD '22 workshop is to bring together those who work on the broader topic of disinformation detection in multimedia in order to share their experiences and discuss their novel ideas, reaching out to people with different backgrounds and expertise. The research domains of interest vary from the detection of manipulated and synthetic content in multimedia to the analysis of the spread of disinformation and its impact on society. The MAD '22 workshop proceedings are available at: https://dl.acm.org/citation.cfm?id=3512732.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115993343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Shi, Yunzhou Zhang, Shangdong Zhu, Yixiu Liu, Sonya A. Coleman, D. Kerr
{"title":"VAC-Net: Visual Attention Consistency Network for Person Re-identification","authors":"W. Shi, Yunzhou Zhang, Shangdong Zhu, Yixiu Liu, Sonya A. Coleman, D. Kerr","doi":"10.1145/3512527.3531409","DOIUrl":"https://doi.org/10.1145/3512527.3531409","url":null,"abstract":"Person re-identification (ReID) is a crucial aspect of recognising pedestrians across multiple surveillance cameras. Even though significant progress has been made in recent years, the viewpoint change and scale variations still affect model performance. In this paper, we observe that it is beneficial for the model to handle the above issues when boost the consistent feature extraction capability among different transforms (e.g., flipping and scaling) of the same image. To this end, we propose a visual attention consistency network (VAC-Net). Specifically, we propose Embedding Spatial Consistency (ESC) architecture with flipping, scaling and original forms of the same image as inputs to learn a consistent embedding space. Furthermore, we design an Input-Wise visual attention consistent loss (IW-loss) so that the class activation maps(CAMs) from the three transforms are aligned with each other to enforce their advanced semantic information remains consistent. Finally, we propose a Layer-Wise visual attention consistent loss (LW-loss) to further enforce the semantic information among different stages to be consistent with the CAMs within each branch. These two losses can effectively improve the model to address the viewpoint and scale variations. Experiments on the challenging Market-1501, DukeMTMC-reID, and MSMT17 datasets demonstrate the effectiveness of the proposed VAC-Net.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126983993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunqing He, Xu Sun, Hui Jiang, Tongwei Ren, Gangshan Wu, M. Astefanoaei, Andreas Leibetseder
{"title":"Reproducibility Companion Paper: Human Object Interaction Detection via Multi-level Conditioned Network","authors":"Yunqing He, Xu Sun, Hui Jiang, Tongwei Ren, Gangshan Wu, M. Astefanoaei, Andreas Leibetseder","doi":"10.1145/3512527.3531438","DOIUrl":"https://doi.org/10.1145/3512527.3531438","url":null,"abstract":"To support the replication of ?Human Object Interaction Detection via Multi-level Conditioned Network\", which was presented at ICMR'20, this companion paper provides the details of the artifacts. Human Object Interaction Detection (HOID) aims to recognize fine-grained object-specific human actions, which demands the capabilities of both visual perception and reasoning. In this paper, we explain the file structure of the source code and publish the details of our experiments settings. We also provide a program for component analysis to assist other researchers with experiments on alternative models that are not included in our experiments. Moreover, we provide a demo program for facilitating the use of our model.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121652696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SA-NAS-BFNR: Spatiotemporal Attention Neural Architecture Search for Task-based Brain Functional Network Representation","authors":"Fenxia Duan, Chunhong Cao, Xieping Gao","doi":"10.1145/3512527.3531421","DOIUrl":"https://doi.org/10.1145/3512527.3531421","url":null,"abstract":"The spatiotemporal representation of task-based brain functional networks is a key topic in functional magnetic resonance image (fMRI) research. At present, deep learning has been more powerful and flexible in brain functional network research than traditional methods. However, the dominant deep learning models failed in capturing the long-distance dependency (LDD) in task-based fMRI images (tfMRI) due to the time correlation among different task stimuli, the nature between temporal and spatial dimensions, which resulting in inaccurate brain pattern extraction. To address this issue, this paper proposes a spatiotemporal attention neural architecture search (NAS) model for task-based brain functional networks representation (SA-NAS-BFNR), where attention mechanism and gate recurrent unit (GRU) are integrated into a novel framework and GRU structure is searched by the differentiable neural architecture search. This model can not only achieve meaningful brain functional networks (BFNs) by addressing the LDD, but also simplify the existing recurrent structure models in tfMRI. Experiments show that the proposed model is capable of improving the fitting ability between time series and task stimulus sequence, and extracting the BFNs effectively as well.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121537548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning","authors":"Guang Chen, Deyuan Zhang, Tao Liu, Xiaoyong Du","doi":"10.1145/3512527.3531364","DOIUrl":"https://doi.org/10.1145/3512527.3531364","url":null,"abstract":"Voice-face association learning (VFAL) aims to tap into the potential connections between voices and faces. Most studies currently address this problem in a supervised manner, which cannot exploit the wealth of unlabeled video data. To solve this problem, we propose an unsupervised learning framework: Self-Lifting (SL), which can use unlabeled video data for learning. This framework includes two iterative steps of \"clustering\" and \"metric learning\". In the first step, unlabeled video data is mapped into the feature space by a coarse model. Then unsupervised clustering is leveraged to allocate pseudo-label to each video. In the second step, the pseudo-label is used as supervisory information to guide the metric learning process, which produces the refined model. These two steps are performed alternately to lift the model's performance. Experiments show that our framework can effectively use unlabeled video data for learning. On the VoxCeleb dataset, our approach achieves SOTA results among the unsupervised methods and has competitive performance compared with the supervised competitors. Our code is released on Github.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"55 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113969478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FedNKD: A Dependable Federated Learning Using Fine-tuned Random Noise and Knowledge Distillation","authors":"Shaoxiong Zhu, Q. Qi, Zirui Zhuang, Jingyu Wang, Haifeng Sun, J. Liao","doi":"10.1145/3512527.3531372","DOIUrl":"https://doi.org/10.1145/3512527.3531372","url":null,"abstract":"Multimedia retrieval models need the ability to extract useful information from large-scale data for clients. As an important part of multimedia retrieval, image classification model directly affects the efficiency and effect of multimedia retrieval. We need a lot of data to train a image classification model applied to multimedia retrieval task. However, with the protection of data privacy, the data used to train the model often needs to be kept on the client side. Federated learning is proposed to use data from all clients to train one model while protecting privacy. When federated learning is applied, the distribution of data across different clients varies greatly. Disregarding this problem yields a final model with unstable performance. To enable federated learning to work dependably in the real world with complex data environments, we propose FedNKD, which utilizes knowledge distillation and random noise. The superior knowledge of each client is distilled into a central server to mitigate the instablity caused by Non-IID data. Importantly, a synthetic dataset is created by some random noise through back propagation of neural networks. The synthetic dataset will contain the abstract features of the real data. Then we will use this synthetic dataset to realize the knowledge distillation while protecting users' privacy. In our experimental scenarios, FedNKD outperforms existing representative algorithms by about 1.5% in accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128271311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neng Zhou, Hairu Wen, Yi Wang, Yang Liu, Longfei Zhou
{"title":"Review of Deep Learning Models for Spine Segmentation","authors":"Neng Zhou, Hairu Wen, Yi Wang, Yang Liu, Longfei Zhou","doi":"10.1145/3512527.3531356","DOIUrl":"https://doi.org/10.1145/3512527.3531356","url":null,"abstract":"Medical image segmentation has been a long-standing chal- lenge due to the limitation in labeled datasets and the exis- tence of noise and artifacts. In recent years, deep learning has shown its capability in achieving successive progress in this field, making its automatic segmentation performance gradually catch up with that of manual segmentation. In this paper, we select twelve state-of-the-art models and compare their performance in the spine MRI segmentation task. We divide them into two categories. One of them is the U-Net family, including U-Net, Attention U-Net, ResUNet++, TransUNet, and MiniSeg. The architectures of these models often ultimately include the encoder-decoder structure, and their innovation generally lies in the way of better fusing low-level and high-level information. Models in the other category, named Models Using Backbone often use ResNet, Res2Net, or other pre-trained models on ImageNet as the backbone to extract information. These models pay more attention capturing multi-scale and rich contextual information. All models are trained and tested on the open-source spine M- RI dataset with 20 labels and no pre-training. Through the comparison, the models using backbone exceed U-Net family, and DeepLabv3+ works best. We suppose it is also necessary to extract multi-scale information in a multi-label medical segmentation task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128287946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment","authors":"Ben Xue, Chenchen Liu, Yadong Mu","doi":"10.1145/3512527.3531371","DOIUrl":"https://doi.org/10.1145/3512527.3531371","url":null,"abstract":"This paper investigates a new research task in multimedia analysis, dubbed as Video2Subtitle. The goal of this task is to finding the most plausible subtitle from a large pool for a querying video clip. We assume that the temporal duration of each sentence in a subtitle is unknown. Compared with existing cross-modal matching tasks, the proposed Video2Subtitle confronts several new challenges. In particular, video frames / subtitle sentences are temporally ordered, respectively, yet no precise synchronization is available. This casts Video2Subtitle into a problem of matching weakly-synchronized sequences. In this work, our technical contributions are two-fold. First, we construct a large-scale benchmark for the Video2Subtitle task. It consists of about 100K video clip / subtitle pairs with a full duration of 759 hours. All data are automatically trimmed from conversational sub-parts of movies and youtube videos. Secondly, an ideal algorithm for tackling Video2Subtitle requires both temporal synchronization of the visual / textual sequences, but also strong semantic consistency between two modalities. To this end, we propose a novel algorithm with the key traits of heterogeneous multi-cue fusion and dynamic temporal alignment. The proposed method demonstrates excellent performances in comparison with several state-of-the-art cross-modal matching methods. Additionally, we also depict a few interesting applications of Video2Subtitle, such as re-generating subtitle for given videos.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134259419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuang Liu, Yunpu Ma, Matthias Schubert, Y. Ouyang, Zhang Xiong
{"title":"Multi-Modal Contrastive Pre-training for Recommendation","authors":"Zhuang Liu, Yunpu Ma, Matthias Schubert, Y. Ouyang, Zhang Xiong","doi":"10.1145/3512527.3531378","DOIUrl":"https://doi.org/10.1145/3512527.3531378","url":null,"abstract":"Personalized recommendation plays a central role in various online applications. To provide quality recommendation service, it is of crucial importance to consider multi-modal information associated with users and items, e.g., review text, description text, and images. However, many existing approaches do not fully explore and fuse multiple modalities. To address this problem, we propose a multi-modal contrastive pre-training model for recommendation. We first construct a homogeneous item graph and a user graph based on the relationship of co-interaction. For users, we propose intra-modal aggregation and inter-modal aggregation to fuse review texts and the structural information of the user graph. For items, we consider three modalities: description text, images, and item graph. Moreover, the description text and image complement each other for the same item. One of them can be used as promising supervision for the other. Therefore, to capture this signal and better exploit the potential correlation of intra-modalities, we propose a self-supervised contrastive inter-modal alignment task to make the textual and visual modalities as similar as possible. Then, we apply inter-modal aggregation to obtain the multi-modal representation of items. Next, we employ a binary cross-entropy loss function to capture the potential correlation between users and items. Finally, we fine-tune the pre-trained multi-modal representations using an existing recommendation model. We have performed extensive experiments on three real-world datasets. Experimental results verify the rationality and effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116558202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly-supervised Cerebrovascular Segmentation Network with Shape Prior and Model Indicator","authors":"Qianrun Wu, Yufei Chen, Ning Huang, Xiaodong Yue","doi":"10.1145/3512527.3531377","DOIUrl":"https://doi.org/10.1145/3512527.3531377","url":null,"abstract":"Labeling cerebral vessels requires domain knowledge in neurology and could be extremely laborious, and there is a scarcity of public annotated cerebrovascular datasets. Traditional machine learning or statistical models could yield decent results on thick vessels with high contrast while having poor performance on those regions of low contrast. In our work, we employ a statistic model as noisy labels and propose a Transformer-based architecture which utilizes Hessian shape prior as soft supervision. It enhances the learning ability of the network to tubular structures, so that the model can make more accurate predictions on refined cerebrovascular segmentation. Furthermore, to combat the overfitting towards noisy labels as model training, we introduce an effective label extension strategy that only calls for a few manual strokes on one sample. These supplementary labels are not used for supervision but only as an indicator to tell where the model keeps the most generalization capability, so as to further guide the model selection in validation. Our experiments are carried out on a public TOF-MRA dataset from MIDAS data platform, and the results demonstrate that our method shows superior performance on cerebrovascular segmentation which achieves Dice of 0.831±0.040 in the dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121043427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}