{"title":"Weight-based Regularization for Improving Robustness in Image Classification","authors":"Hao Yang, Min Wang, Zhengfei Yu, Yun Zhou","doi":"10.1109/ICME55011.2023.00305","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00305","url":null,"abstract":"Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks. Recently, Stochastic Neural Networks (SNNs) have been proposed to enhance adversarial robustness by injecting uncertainty into the models. However, existing SNNs often inspired by intuition and rely on adversarial training, which is computationally costly. To address this issue, we propose a novel SNN called the Weight-based Stochastic Neural Network (WB-SNN), which is based on optimizing an error upper bound of adversarial robustness from the perspective of weight distribution. To the best of our knowledge, we are the first to propose a theoretically guaranteed weight-based stochastic neural network without relying on adversarial training. In comparison to normal adversarial training, our method saves about three times the computation cost. Extensive experiments on various datasets, networks, and adversarial attacks have demonstrated the effectiveness of the proposed method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130429013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianlong Ma, Xingjiao Wu, Xiangcheng Du, Yanlong Wang, Cheng Jin
{"title":"Image Layer Modeling for Complex Document Layout Generation","authors":"Tianlong Ma, Xingjiao Wu, Xiangcheng Du, Yanlong Wang, Cheng Jin","doi":"10.1109/ICME55011.2023.00386","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00386","url":null,"abstract":"Document layout analysis (DLA) plays an essential role in information extraction and document understanding. At present, DLA has reached the milestone achievement; however, DLA of non-Manhattan is still challenging because of annotation data limitations. In this paper, we propose an image layer modeling method to mitigate this issue. The image layer modeling method generates document images of non-Manhattan layouts by superimposing images under pre-defined aesthetic rules. Due to the lack of evaluation benchmark for non-Manhattan layout, we have constructed a manually-labeled non-Manhattan layout fine-grained segmentation dataset. To the best of our knowledge, this is the first manually-labeled non-Manhattan layout fine-grained segmentation dataset. Extensive experimental results verify that our proposed image layer modeling method can better deal with the fine-grained segmented document of the non-Manhattan layout.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128843289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MSG-CAM:Multi-scale inputs make a better visual interpretation of CNN networks","authors":"Xiaohong Xiang, Fuyuan Zhang, Xin Deng, Ke Hu","doi":"10.1109/ICME55011.2023.00061","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00061","url":null,"abstract":"The visualization of deep learning models has been widely studied as an effective means of exploring the decision-making processes within these models. However, current visualization methods suffer from several limitations, such as low resolution and poor visualization of multiple occurrences of the same class. In this paper, we propose a novel visualization technique called MSG-CAM, which is an improvement on the existing Group-CAM method. Our method uses the feature maps and gradients of the last layer of the convolutional neural network to create masks through multi-scale enlargement of the original input image and fusion of the resulting feature maps and gradients. Through both qualitative and quantitative analysis, we have demonstrated that the saliency maps generated by our method are more reasonable and accurately reflect the internal decision-making processes of the neural network.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125468642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio-Visual Generalized Zero-Shot Learning Based on Variational Information Bottleneck","authors":"Yapeng Li, Yong Luo, Bo Du","doi":"10.1109/ICME55011.2023.00084","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00084","url":null,"abstract":"Audio-visual generalized zero-shot learning (GZSL) aims to train a model on seen classes for classifying data samples from both seen classes and unseen classes. Due to the absence of unseen training samples, the model tends to misclassify unseen class samples into seen classes. To mitigate this problem, in this paper, we propose a method based on variational information bottleneck for audio-visual GZSL. Specifically, we model the joint representations as a product-of-experts over marginal representations to integrate the information of audio and visual. Besides, we introduce variational information bottleneck to the learning of audio-visual joint representations and marginal representations of audio, visual, and text label modalities. This helps our model reduce the negative impact of information that cannot be generalized to unseen classes. Experimental results conducted on the UCF-GZSL, VGGSound-GZSL, and ActivityNet-GZSL benchmarks demonstrate the effectiveness and superiority of the proposed model for audio-visual GZSL.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125502994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Xie, Yixuan Zhou, Xing Xu, Guoqing Wang, Fumin Shen, Yang Yang
{"title":"Region-Aware Semantic Consistency for Unsupervised Domain-Adaptive Semantic Segmentation","authors":"Jun Xie, Yixuan Zhou, Xing Xu, Guoqing Wang, Fumin Shen, Yang Yang","doi":"10.1109/ICME55011.2023.00024","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00024","url":null,"abstract":"As acquiring pixel-wise labels for semantic segmentation is labor-intensive, unsupervised domain adaptation (UDA) techniques aim to transfer knowledge from synthetic data to real-scene data. To overcome the distribution misalignment between the source domain and the target domain, Teacher-Student (TS) methods are widely-used and promising. In TS methods, the student resorts to the one-hot pseudo labels generated by the teacher. However, the generated one-hot pseudo labels are dubious and ignore the semantic correlation among classes. Besides, in the same position of the same image, the output distributions between the student and the teacher should be consistent. Such prediction consistency is defined as Region-Aware Semantic Consistency (RASC). Correspondingly, we propose an RASC module to assimilate the output distributions of the teacher and the student. Our RASC module is flexible and easily plugged into TS state-of-the-arts (SOTAs) based on either CNNs or Transformers.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126853258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-stream Adaptive Offloading of Joint Compressed Video Streams, Feature Streams, and Semantic Streams in Edge Computing Systems","authors":"Dieli Hu, Wen Ji, Zhi Wang","doi":"10.1109/ICME55011.2023.00175","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00175","url":null,"abstract":"Edge computing (EC) is a promising paradigm for serving latency-sensitive video applications. However, massive compressed video transmission and analysis require considerable bandwidth and computing resources, posing enormous challenges for current multimedia frameworks. Novel multi-stream frameworks that incorporate feature streams are more practical. The reason is that feature streams containing compact video frame feature data have a lower bitrate and better serve machine vision tasks. Nevertheless, feature extraction by devices increases the latency and energy consumption of local computing. Therefore, how to offload suitable streams according to video task requirements and system resources is a challenging issue. This paper studies EC-based multi-stream adaptive offloading. We model the multi-stream offloading and computation problem to maximize system utility by jointly optimizing offloading decisions, computation resource allocation, and video frame sampling rates. Frame sampling rates, processing latency, and energy consumption are considered in system utility modeling. The formulated optimization problem is a mixed-integer programming (MIP) problem. We propose an efficient algorithm to address this MIP problem. The proposed algorithm relies on the Hungarian algorithm and improved greedy Markov approximation. The simulation results validate our proposed algorithm’s superior performance.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126899192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative Spatial-Temporal Distillation for Efficient Video Deraining","authors":"Yuzhang Hu, Minghao Liu, Wenhan Yang, Jiaying Liu, Zongming Guo","doi":"10.1109/ICME55011.2023.00332","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00332","url":null,"abstract":"In this paper, we propose a novel knowledge distillation framework to improve the efficiency of deep networks for video deraining. The knowledge is transferred from a large-scale powerful teacher network to a compact efficient student network via the proposed collaborative spatial-temporal distillation framework. The framework is equipped with three collaboration schemes of different granularities that make use of spatial-temporal redundancy in a complementary way for better distillation performance. First, the spatial alignment module applies distillation constraints at different spatial scales to achieve better scale invariance in transferred knowledge. Second, the temporal alignment module traces both temporal status between teacher and student separately and collaboratively, to comprehensively utilize inter-frame information. Third, these two alignment modules interact through a spatial-temporal adaptor, where spatial-temporal knowledge is transferred in a unified framework. Extensive experiments demonstrate the superiority of our distillation framework as well as the effectiveness of each module. Our code is available at: https://github.com/HuYuzhang/Knowledge-Distillation.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126255030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DFCP: Few-Shot DeepFake Detection via Contrastive Pretraining","authors":"Bojing Zou, Chao Yang, Jiazhi Guan, Chengbin Quan, Youjian Zhao","doi":"10.1109/ICME55011.2023.00393","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00393","url":null,"abstract":"Abuses of forgery techniques have created a considerable problem of misinformation on social media. Although scholars devote many efforts to face forgery detection (a.k.a DeepFake detection) and achieve some results, two issues still hinder the practical application. 1) Most detectors do not generalize well to unseen datasets. 2) In a supervised manner, most previous works require a considerable amount of manually labeled data. To address these problems, we propose a simple contrastive pertaining framework for DeepFake detection (DFCP), which works in a finetuning-after-pretraining manner, and requires only a few labels (5%). Specifically, we design a two-stream framework to simultaneously learn high-frequency texture features and high-level semantics information during pretraining. In addition, a video-based frame sampling strategy is proposed to mitigate potential noise data in the instance-discriminative contrastive learning to achieve better performance. Experimental results on several downstream datasets show the state-of-the-art performance of the proposed DFCP, which works at frame-level (w/o temporal reasoning) with high efficiency but outperforms video-level methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123016843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"E2: Entropy Discrimination and Energy Optimization for Source-free Universal Domain Adaptation","authors":"Meng Shen, A. J. Ma, PongChi Yuen","doi":"10.1109/ICME55011.2023.00460","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00460","url":null,"abstract":"Universal domain adaptation (UniDA) transfers knowledge under both distribution and category shifts. Most UniDA methods accessible to source-domain data during model adaptation may result in privacy policy violation and source-data transfer inefficiency. To address this issue, we propose a novel source-free UniDA method coupling confidence-guided entropy discrimination and likelihood-induced energy optimization. The entropy-based separation of target-known and unknown classes is too conservative for known-class prediction. Thus, we derive the confidence-guided entropy by scaling the normalized prediction score with the known-class confidence, that more known-class samples are correctly predicted. Due to difficult estimation of the marginal distribution without source-domain data, we constrain the target-domain marginal distribution by maximizing (minimizing) the known (unknown)-class likelihood, which equals free energy optimization. Theoretically, the overall optimization amounts to decreasing and increasing internal energy of known and unknown classes in physics, respectively. Extensive experiments demonstrate the superiority of the proposed method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"10 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120921294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Liu, Degang Sun, Yan Wang, Zhongyuan Chen, Xinbo Han, Haitian Yang
{"title":"ABTD-Net: Autonomous Baggage Threat Detection Networks for X-ray Images","authors":"Wen Liu, Degang Sun, Yan Wang, Zhongyuan Chen, Xinbo Han, Haitian Yang","doi":"10.1109/ICME55011.2023.00214","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00214","url":null,"abstract":"Automated security screening has a significant role In protecting public spaces from security threats by employing X-ray images to detect prohibited items. However, there are challenges of noise production due to squeezing, occlusion, and penetration of luggage objects. Additionally, the hues of objects are monotonous and lack luster. To solve these problems, we propose an Autonomous Baggage Threat Detection Network (ABTD-Net) for accurate prohibited item detection. To tackle the difficulty of capturing distinctive visual features, we constructed a Feature Adjustment Head (FAH) to refine pyramid features. Specifically, we designed an Attention Module (AM) at several places after initially using a Dense Unidirectional Propagation (DUP) to filter noise. Furthermore, we created a Feature Fusion Head (FFH) that dynamically fuses hierarchical visual information under object occlusion, including early-fusion and late-fusion. Extensive experiments on security inspection X-ray datasets OPIXray and HiXray demonstrate the superiority of our proposed method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121120454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}