S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen
{"title":"Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search","authors":"S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen","doi":"10.1145/3512527.3531399","DOIUrl":"https://doi.org/10.1145/3512527.3531399","url":null,"abstract":"We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong
{"title":"ICDAR'22: Intelligent Cross-Data Analysis and Retrieval","authors":"Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong","doi":"10.1145/3512527.3531441","DOIUrl":"https://doi.org/10.1145/3512527.3531441","url":null,"abstract":"We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on \"Intelligent Cross-Data Analysis and Retrieval\" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133705913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual-Channel Localization Networks for Moment Retrieval with Natural Language","authors":"Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang","doi":"10.1145/3512527.3531394","DOIUrl":"https://doi.org/10.1145/3512527.3531394","url":null,"abstract":"According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto
{"title":"Motor Learning based on Presentation of a Tentative Goal","authors":"S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto","doi":"10.1145/3512527.3531413","DOIUrl":"https://doi.org/10.1145/3512527.3531413","url":null,"abstract":"This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133078445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris
{"title":"Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning","authors":"Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris","doi":"10.1145/3512527.3531383","DOIUrl":"https://doi.org/10.1145/3512527.3531383","url":null,"abstract":"Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133080794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN","authors":"Lijia Deng, Yu-dong Zhang","doi":"10.1145/3512527.3531410","DOIUrl":"https://doi.org/10.1145/3512527.3531410","url":null,"abstract":"Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label","authors":"Anqi Hu, Zhengxing Sun, Qian Li","doi":"10.1145/3512527.3531419","DOIUrl":"https://doi.org/10.1145/3512527.3531419","url":null,"abstract":"Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis","authors":"Pei Dong, L. Wu, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531389","DOIUrl":"https://doi.org/10.1145/3512527.3531389","url":null,"abstract":"In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114681329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang
{"title":"Improving Image Captioning via Enhancing Dual-Side Context Awareness","authors":"Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang","doi":"10.1145/3512527.3531379","DOIUrl":"https://doi.org/10.1145/3512527.3531379","url":null,"abstract":"Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phrase-level Prediction for Video Temporal Localization","authors":"Sizhe Li, C. Li, Minghang Zheng, Yang Liu","doi":"10.1145/3512527.3531382","DOIUrl":"https://doi.org/10.1145/3512527.3531382","url":null,"abstract":"Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114735390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}