Kai Wang, Yifan Wang, Xing Xu, Xin Liu, Weihua Ou, Huimin Lu
{"title":"Prototype-based Selective Knowledge Distillation for Zero-Shot Sketch Based Image Retrieval","authors":"Kai Wang, Yifan Wang, Xing Xu, Xin Liu, Weihua Ou, Huimin Lu","doi":"10.1145/3503161.3548382","DOIUrl":"https://doi.org/10.1145/3503161.3548382","url":null,"abstract":"Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is an emerging research task that aims to retrieve data of new classes across sketches and images. It is challenging due to the heterogeneous distributions and the inconsistent semantics across seen and unseen classes of the cross-modal data of sketches and images. To realize knowledge transfer, the latest approaches introduce knowledge distillation, which optimizes the student network through the teacher signal distilled from the teacher network pre-trained on large-scale datasets. However, these methods often ignore the mispredictions of the teacher signal, which may make the model vulnerable when disturbed by the wrong output of the teacher network. To tackle the above issues, we propose a novel method termed Prototype-based Selective Knowledge Distillation (PSKD) for ZS-SBIR. Our PSKD method first learns a set of prototypes to represent categories and then utilizes an instance-level adaptive learning strategy to strengthen semantic relations between categories. Afterwards, a correlation matrix targeted for the downstream task is established through the prototypes. With the learned correlation matrix, the teacher signal given by transformers pre-trained on ImageNet and fine-tuned on the downstream dataset, can be reconstructed to weaken the impact of mispredictions and selectively distill knowledge on the student network. Extensive experiments conducted on three widely-used datasets demonstrate that the proposed PSKD method establishes the new state-of-the-art performance on all datasets for ZS-SBIR.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130264252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences","authors":"Dingkang Yang, Haopeng Kuang, Shuai Huang, Lihua Zhang","doi":"10.1145/3503161.3547755","DOIUrl":"https://doi.org/10.1145/3503161.3547755","url":null,"abstract":"Understanding human behaviors and intents from videos is a challenging task. Video flows usually involve time-series data from different modalities, such as natural language, facial gestures, and acoustic information. Due to the variable receiving frequency for sequences from each modality, the collected multimodal streams are usually unaligned. For multimodal fusion of asynchronous sequences, the existing methods focus on projecting multiple modalities into a common latent space and learning the hybrid representations, which neglects the diversity of each modality and the commonality across different modalities. Motivated by this observation, we propose a Multimodal Fusion approach for learning modality-Specific and modality-Agnostic representations (MFSA) to refine multimodal representations and leverage the complementarity across different modalities. Specifically, a predictive self-attention module is used to capture reliable contextual dependencies and enhance the unique features over the modality-specific spaces. Meanwhile, we propose a hierarchical cross-modal attention module to explore the correlations between cross-modal elements over the modality-agnostic space. In this case, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, the modality-specific and -agnostic multimodal representations are used together for downstream tasks. Comprehensive experiments on three multimodal datasets clearly demonstrate the superiority of our approach.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130452545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complementarity-Enhanced and Redundancy-Minimized Collaboration Network for Multi-agent Perception","authors":"Guiyang Luo, Hui Zhang, Quan Yuan, Jinglin Li","doi":"10.1145/3503161.3548197","DOIUrl":"https://doi.org/10.1145/3503161.3548197","url":null,"abstract":"Multi-agent collaborative perception depends on sharing sensory information to improve perception accuracy and robustness, as well as to extend coverage. The cooperative shared information between agents should achieve an equilibrium between redundancy and complementarity, thus creating a concise and composite representation. To this end, this paper presents a complementarity-enhanced and redundancy-minimized collaboration network (CRCNet), for efficiently guiding and supervising the fusion among shared features. Our key novelties lie in two aspects. First, each fused feature is forced to bring about a marginal gain by exploiting a contrastive loss, which can supervise our model to select complementary features. Second, mutual information is applied to measure the dependence between fused feature pairs and the upper bound of mutual information is minimized to encourage independence, thus guiding our model to select irredundant features. Furthermore, the above modules are incorporated into a feature fusion network CRCNet. Our quantitative and qualitative experiments in collaborative object detection show that CRCNet performs better than the state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129253347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianchi Huang, Chao Zhou, Lianchen Jia, Ruixiao Zhang, Lifeng Sun
{"title":"Learned Internet Congestion Control for Short Video Uploading","authors":"Tianchi Huang, Chao Zhou, Lianchen Jia, Ruixiao Zhang, Lifeng Sun","doi":"10.1145/3503161.3548436","DOIUrl":"https://doi.org/10.1145/3503161.3548436","url":null,"abstract":"Short video uploading service has become increasingly important, as at least 30 million videos are uploaded per day. However, we find that existing congestion control (CC) algorithms, either heuristics or learning-based, are not applicable for video uploading -- i.e., lacking in the design of the fundamental mechanism and being short of leveraging network modeling. We present DuGu, a novel learning-based CC algorithm designed by considering the unique proprieties of video uploading via the probing phase and internet networking via the control phase. During the probing phase, DuGu leverages the transmission gap of uploading short videos to actively detect the network metrics to better understand network dynamics. DuGu uses a neural network~(NN) to avoid congestion during the control phase. Here, instead of using handcrafted reward functions, the NN is learned by imitating the expert policy given by the optimal solver, improving both performance and learning efficiency. To build this system, we construct an omniscient-like network emulator, implement an optimal solver and collect a large corpus of real-world network traces to learn expert strategies. Trace-driven and real-world A/B tests reveal that DuGu supports multi-objective and rivals or outperforms existing CC algorithms across all considered scenarios.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125350067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongqi Zhai, Luyang Tang, Y. Ma, Rui Peng, Rong Wang
{"title":"Disparity-based Stereo Image Compression with Aligned Cross-View Priors","authors":"Yongqi Zhai, Luyang Tang, Y. Ma, Rui Peng, Rong Wang","doi":"10.1145/3503161.3548136","DOIUrl":"https://doi.org/10.1145/3503161.3548136","url":null,"abstract":"With the wide application of stereo images in various fields, the research on stereo image compression (SIC) attracts extensive attention from academia and industry. The core of SIC is to fully explore the mutual information between the left and right images and reduce redundancy between views as much as possible. In this paper, we propose DispSIC, an end-to-end trainable deep neural network, in which we jointly train a stereo matching model to assist in the image compression task. Based on the stereo matching results (i.e. disparity), the right image can be easily warped to the left view, and only the residuals between the left and right views are encoded for the left image. A three-branch auto-encoder architecture is adopted in DispSIC, which encodes the right image, the disparity map and the residuals respectively. During training, the whole network can learn how to adaptively allocate bitrates to these three parts, achieving better rate-distortion performance at the cost of a lower disparity map bitrates. Moreover, we propose a conditional entropy model with aligned cross-view priors for SIC, which takes the warped latents of the right image as priors to improve the accuracy of the probability estimation for the left image. Experimental results demonstrate that our proposed method achieves superior performance compared to other existing SIC methods on the KITTI and InStereo2K datasets both quantitatively and qualitatively.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126852083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-tuning with Multi-modal Entity Prompts for News Image Captioning","authors":"Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang","doi":"10.1145/3503161.3547883","DOIUrl":"https://doi.org/10.1145/3503161.3547883","url":null,"abstract":"News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126895227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, A. BinM., Qingming Huang, Xiaoming Wei, Xiaolin Wei
{"title":"Zero-shot Video Classification with Appropriate Web and Task Knowledge Transfer","authors":"Junbao Zhuo, Yan Zhu, Shuhao Cui, Shuhui Wang, A. BinM., Qingming Huang, Xiaoming Wei, Xiaolin Wei","doi":"10.1145/3503161.3548008","DOIUrl":"https://doi.org/10.1145/3503161.3548008","url":null,"abstract":"Zero-shot video classification (ZSVC) that aims to recognize video classes that have never been seen during model training, has become a thriving research direction. ZSVC is achieved by building mappings between visual and semantic embeddings. Recently, ZSVC has been achieved by automatically mining the underlying objects in videos as attributes and incorporating external commonsense knowledge. However, the object mined from seen categories can not generalized to unseen ones. Besides, the category-object relationships are usually extracted from commonsense knowledge or word embedding, which is not consistent with video modality. To tackle these issues, we propose to mine associated objects and category-object relationships for each category from retrieved web images. The associated objects of all categories are employed as generic attributes and the mined category-object relationships could narrow the modality inconsistency for better knowledge transfer. Another issue of existing ZSVC methods is that the model sufficiently trained with labeled seen categories may not generalize well to distinct unseen categories. To encourage a more reliable transfer, we propose Task Similarity aware Representation Learning (TSRL). In TSRL, the similarity between seen categories and the unseen ones is estimated and used to regularize the model in an appropriate way. We construct a model for ZSVC based on the constructed attributes, the mined category-object relationships and the proposed TSRL. Experimental results on four public datasets, i.e., FCVID, UCF101, HMDB51 and Olympic Sports, show that our model performs favorably against state-of-the-art methods. Our codes are publicly available at https://github.com/junbaoZHUO/TSRL.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126477382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QoE-aware Download Control and Bitrate Adaptation for Short Video Streaming","authors":"Ximing Wu, Lei Zhang, Laizhong Cui","doi":"10.1145/3503161.3551590","DOIUrl":"https://doi.org/10.1145/3503161.3551590","url":null,"abstract":"Nowadays, although the rapidly growing demand for short video sharing has brought enormous commercial value, considerable bandwidth usage becomes a problem for service providers. To save costs of service providers, the short video applications face a critical conflict between maximizing the user quality of experience (QoE) and minimizing the bandwidth usage. Most of existing bitrate adaptation methods are designed for the livecast and video-on-demand instead of short video applications. In this paper, we propose a QoE-aware adaptive download control mechanism to ensure the user QoE and save the bandwidth, which can download the appropriate video according to user retention probabilities and network conditions, and pause the download when the buffered data is enough. The extensive simulation results demonstrate the superiority of our proposed mechanism over the other baseline methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121392613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Li, Xing Xu, Zailei Zhou, Yang Yang, Guoqing Wang, Heng Tao Shen
{"title":"ARRA: Absolute-Relative Ranking Attack against Image Retrieval","authors":"S. Li, Xing Xu, Zailei Zhou, Yang Yang, Guoqing Wang, Heng Tao Shen","doi":"10.1145/3503161.3548138","DOIUrl":"https://doi.org/10.1145/3503161.3548138","url":null,"abstract":"With the extensive application of deep learning, adversarial attacks especially query-based attacks receive more concern than ever before. However, the scenarios assumed by existing query-based attacks against image retrieval are usually too simple to satisfy the attack demand. In this paper, we propose a novel method termed Absolute-Relative Ranking Attack (ARRA) that considers a more practical attack scenario. Specifically, we propose two compatible goals for the query-based attack, i.e., absolute ranking attack and relative ranking attack, which aim to change the relative order of chosen candidates and assign the specific ranks to chosen candidates in retrieval list respectively. We further devise the Absolute Ranking Loss (ARL) and Relative Ranking Loss (RRL) for the above goals and implement our ARRA by minimizing their combination with black-box optimizers and evaluate the attack performance by attack success rate and normalized ranking correlation. Extensive experiments conducted on widely-used SOP and CUB-200 datasets demonstrate the superiority of the proposed approach over the baselines. Moreover, the attack result on a real-world image retrieval system, i.e., Huawei Cloud Image Search, also proves the practicability of our ARRA approach.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121418222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OISSR: Optical Image Stabilization Based Super Resolution on Smartphone Cameras","authors":"Hao Pan, Feitong Tan, Wenhao Li, Yi-Chao Chen, Guangtao Xue","doi":"10.1145/3503161.3547964","DOIUrl":"https://doi.org/10.1145/3503161.3547964","url":null,"abstract":"Multi-frame super-resolution methods can generate high resolution images by combining multiple captures of the same scene; however, the performance of merged results are susceptible to degradation due to a lack of precision in image registration. In this study, we sought to develop a robust multi-frame super resolution method (called OISSR) for use on smartphone cameras with a optical image stabilizer (OIS). Acoustic injection is used to alter the readings from the built-in MEMS gyroscope to control the lens motion in the OIS module (note that the image sensor is fixed). We employ a priori knowledge of the induced lens motion to facilitate optical flow estimation with sub-pixel accuracy, and the output high-precision pixel alignment vectors are utilized to merge the multiple frames to reconstruct the final super resolution image. Extensive experiments on a OISSR prototype implemented on a Xiaomi 10Ultra demonstrate the high performance and effectiveness of the proposed system in obtaining the quadruple enhanced resolution imaging.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}