Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献_第6页

Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search 加速符号猎人:基于分支修剪策略和稳定层次搜索的符号黑盒攻击

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531399

S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen

{"title":"Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search","authors":"S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen","doi":"10.1145/3512527.3531399","DOIUrl":"https://doi.org/10.1145/3512527.3531399","url":null,"abstract":"We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ICDAR'22: Intelligent Cross-Data Analysis and Retrieval ICDAR'22:智能交叉数据分析与检索

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531441

Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong

引用次数: 1

Dual-Channel Localization Networks for Moment Retrieval with Natural Language 基于自然语言的双通道定位网络

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531394

Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang

{"title":"Dual-Channel Localization Networks for Moment Retrieval with Natural Language","authors":"Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang","doi":"10.1145/3512527.3531394","DOIUrl":"https://doi.org/10.1145/3512527.3531394","url":null,"abstract":"According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Motor Learning based on Presentation of a Tentative Goal 基于试探性目标呈现的运动学习

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531413

S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto

{"title":"Motor Learning based on Presentation of a Tentative Goal","authors":"S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto","doi":"10.1145/3512527.3531413","DOIUrl":"https://doi.org/10.1145/3512527.3531413","url":null,"abstract":"This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133078445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning 使用监督深度学习的未爆弹药自动视觉识别

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531383

Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris

{"title":"Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning","authors":"Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris","doi":"10.1145/3512527.3531383","DOIUrl":"https://doi.org/10.1145/3512527.3531383","url":null,"abstract":"Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133080794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN MFGAN:基于GAN的轻量级快速多任务多尺度特征融合模型

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531410

Lijia Deng, Yu-dong Zhang

{"title":"MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN","authors":"Lijia Deng, Yu-dong Zhang","doi":"10.1145/3512527.3531410","DOIUrl":"https://doi.org/10.1145/3512527.3531410","url":null,"abstract":"Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label 基于小数据和粗标签联合学习的弱监督细粒度识别

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531419

Anqi Hu, Zhengxing Sun, Qian Li

{"title":"Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label","authors":"Anqi Hu, Zhengxing Sun, Qian Li","doi":"10.1145/3512527.3531419","DOIUrl":"https://doi.org/10.1145/3512527.3531419","url":null,"abstract":"Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis 文本到图像合成中多粒度特征的解纠缠表示和层次细化

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531389

Pei Dong, L. Wu, Lei Meng, Xiangxu Meng

{"title":"Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis","authors":"Pei Dong, L. Wu, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531389","DOIUrl":"https://doi.org/10.1145/3512527.3531389","url":null,"abstract":"In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114681329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Improving Image Captioning via Enhancing Dual-Side Context Awareness 通过增强双面上下文感知改善图像字幕

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531379

Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang

{"title":"Improving Image Captioning via Enhancing Dual-Side Context Awareness","authors":"Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang","doi":"10.1145/3512527.3531379","DOIUrl":"https://doi.org/10.1145/3512527.3531379","url":null,"abstract":"Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Phrase-level Prediction for Video Temporal Localization 视频时间定位的短语级预测

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531382

Sizhe Li, C. Li, Minghang Zheng, Yang Liu

{"title":"Phrase-level Prediction for Video Temporal Localization","authors":"Sizhe Li, C. Li, Minghang Zheng, Yang Liu","doi":"10.1145/3512527.3531382","DOIUrl":"https://doi.org/10.1145/3512527.3531382","url":null,"abstract":"Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114735390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4