Hao Pan, Xiaoli Zhao, Lipeng He, Yicong Shi, Xiaogang Lin
{"title":"A survey of multimodal federated learning: background, applications, and perspectives","authors":"Hao Pan, Xiaoli Zhao, Lipeng He, Yicong Shi, Xiaogang Lin","doi":"10.1007/s00530-024-01422-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01422-9","url":null,"abstract":"<p>Multimodal Federated Learning (MMFL) is a novel machine learning technique that enhances the capabilities of traditional Federated Learning (FL) to support collaborative training of local models using data available in various modalities. With the generation and storage of a vast amount of multimodal data from the internet, sensors, and mobile devices, as well as the rapid iteration of artificial intelligence models, the demand for multimodal models is growing rapidly. While FL has been widely studied in the past few years, most of the existing research was based in unimodal settings. With the hope of inspiring more applications and research within the MMFL paradigm, we conduct a comprehensive review of the progress and challenges in various aspects of state-of-the-art MMFL. Specifically, we analyze the research motivation for MMFL, propose a new classification method of existing research, discuss the available datasets and application scenarios, and put forward perspectives on the opportunities and challenges faced by MMFL.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Li, Liquan Chen, Jianchang Lai, Zhangjie Fu, Suhui Liu
{"title":"GAN-based image steganography by exploiting transform domain knowledge with deep networks","authors":"Xiao Li, Liquan Chen, Jianchang Lai, Zhangjie Fu, Suhui Liu","doi":"10.1007/s00530-024-01427-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01427-4","url":null,"abstract":"<p>Image steganography secures the transmission of secret information by covering it under routine multimedia transmission. During image generation based on Generative Adversarial Network (GAN), the embedding and recovery of secret bits can rely entirely on deep networks, relieving many manual design efforts. However, existing GAN-based methods always design deep networks by adapting generic deep learning structures to image steganography. These structures themselves lack the feature extraction that is effective for steganography, resulting in the low imperceptibility of these methods. To address the problem, we propose GAN-based image steganography by exploiting transform domain knowledge with deep networks, called EStegTGANs. Different from existing GAN-based methods, we explicitly introduce transform domain knowledge with Discrete Wavelet Transform (DWT) and its inverse (IDWT) in deep networks, ensuring that each network performs with DWT features. Specifically, the encoder embeds secrets and generates stego images with the explicit DWT and IDWT approaches. The decoder recovers secrets and the discriminator evaluates feature distribution with the explicit DWT approach. By utilizing traditional DWT and IDWT approaches, we propose EStegTGAN-coe, which directly adopts the DWT coefficients of pixels for embedding and recovering. To create more feature redundancy for secrets, we extract DWT features from the intermediate features in deep networks for embedding and recovering. We then propose EStegTGAN-DWT with traditional DWT and IDWT approaches. To entirely rely on deep networks without traditional filters, we design the convolutional DWT and IDWT approaches that conduct the same feature transformation on features as traditional approaches. We further replace the traditional approaches in EStegTGAN-DWT with our proposed convolutional approaches. Comprehensive experimental results demonstrate that our proposals significantly improve the imperceptibility and our designed convolutional DWT and IDWT approaches are more effective in distinguishing high-frequency characteristics of images for steganography than traditional DWT and IDWT approaches.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coordinate-aligned multi-camera collaboration for active multi-object tracking","authors":"Zeyu Fang, Jian Zhao, Mingyu Yang, Zhenbo Lu, Wengang Zhou, Houqiang Li","doi":"10.1007/s00530-024-01420-x","DOIUrl":"https://doi.org/10.1007/s00530-024-01420-x","url":null,"abstract":"<p>Active Multi-Object Tracking (AMOT) is a task where cameras are controlled by a centralized system to adjust their poses automatically and collaboratively so as to maximize the coverage of targets in their shared visual field. In AMOT, each camera only receives partial information from its observation, which may mislead cameras to take locally optimal action. Besides, the global goal, i.e., maximum coverage of objects, is hard to be directly optimized. To address the above issues, we propose a coordinate-aligned multi-camera collaboration system for AMOT. In our approach, we regard each camera as an agent and address AMOT with a multi-agent reinforcement learning solution. To represent the observation of each agent, we first identify the targets in the camera view with an image detector and then align the coordinates of the targets via inverse projection transformation. We define the reward of each agent based on both global coverage as well as four individual reward terms. The action policy of the agents is derived from a value-based Q-network. To the best of our knowledge, we are the first to study the AMOT task. To train and evaluate the efficacy of our system, we build a virtual yet credible 3D environment, named “Soccer Court”, to mimic the real-world AMOT scenario. The experimental results show that our system outperforms the baseline and existing methods in various settings, including real-world datasets.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAM-guided contrast based self-training for source-free cross-domain semantic segmentation","authors":"Qinghua Ren, Ke Hou, Yongzhao Zhan, Chen Wang","doi":"10.1007/s00530-024-01426-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01426-5","url":null,"abstract":"<p>Traditional domain adaptive semantic segmentation methods typically assume access to source domain data during training, a paradigm known as source-access domain adaptation for semantic segmentation (SASS). To address data privacy concerns in real-world applications, source-free domain adaptation for semantic segmentation (SFSS) has recently been studied, eliminating the need for direct access to source data. Most SFSS methods primarily utilize pseudo-labels to regularize the model in either the label space or the feature space. Inspired by the segment anything model (SAM), we propose SAM-guided contrast based pseudo-label learning for SFSS in this work. Unlike previous methods that heavily rely on noisy pseudo-labels, we leverage the class-agnostic segmentation masks generated by SAM as prior knowledge to construct positive and negative sample pairs. This approach allows us to directly shape the feature space using contrastive learning. This design ensures the reliable construction of contrastive samples and exploits both intra-class and intra-instance diversity. Our framework is built upon a vanilla teacher–student network architecture for online pseudo-label learning. Consequently, the SFSS model can be jointly regularized in both the feature and label spaces in an end-to-end manner. Extensive experiments demonstrate that our method achieves competitive performance in two challenging SFSS tasks.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RA-RevGAN: region-aware reversible adversarial example generation network for privacy-preserving applications","authors":"Jiacheng Zhao, Xiuming Zhao, Zhihua Gan, Xiuli Chai, Tianfeng Ma, Zhen Chen","doi":"10.1007/s00530-024-01425-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01425-6","url":null,"abstract":"<p>The rise of online sharing platforms has provided people with diverse and convenient ways to share images. However, a substantial amount of sensitive user information is contained within these images, which can be easily captured by malicious neural networks. To ensure the secure utilization of authorized protected data, reversible adversarial attack techniques have emerged. Existing algorithms for generating adversarial examples do not strike a good balance between visibility and attack capability. Additionally, the network oscillations generated during the training process affect the quality of the final examples. To address these shortcomings, we propose a novel reversible adversarial network based on generative adversarial networks (RA-RevGAN). In this paper, the generator is used for noise generation to map features into perturbations of the image, while the region selection module confines these perturbations to specific areas that significantly affect classification. Furthermore, a robust attack mechanism is integrated into the discriminator to stabilize the network’s training by optimizing convergence speed and minimizing time cost. Extensive experiments have demonstrated that the proposed method ensures a high image generation rate, excellent attack capability, and superior visual quality while maintaining high classification accuracy in image restoration.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hairui Yang, Ning Wang, Haojie Li, Lei Wang, Zhihui Wang
{"title":"Application of CLIP for efficient zero-shot learning","authors":"Hairui Yang, Ning Wang, Haojie Li, Lei Wang, Zhihui Wang","doi":"10.1007/s00530-024-01414-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01414-9","url":null,"abstract":"<p>Zero-shot learning (ZSL) addresses the challenging task of recognizing classes absent during training. Existing methodologies focus on knowledge transfer from known to unknown categories by formulating a correlation between visual and semantic spaces. However, these methods are faced with constraints related to the discrimination of visual features and the integrity of semantic representations. To alleviate these limitations, we propose a novel Collaborative learning Framework for Zero-Shot Learning (CFZSL), which integrates the CLIP architecture into a fundamental zero-shot learner. Specifically, the foundational zero-shot learning model extracts visual features through a set of CNNs and maps them to a domain-specific semantic space. Simultaneously, the CLIP image encoder extracts visual features containing universal semantics. In this way, the CFZSL framework can obtain discriminative visual features for both domain-specific and domain-agnostic semantics. Additionally, a more comprehensive semantic space is explored by combining the latent feature space learned by CLIP and the domain-specific semantic space. Notably, we just leverage the pre-trained parameters of the CLIP model, mitigating the high training cost and potential overfitting issues associated with fine-tuning. Our proposed framework, characterized by its simple structure, undergoes training exclusively via classification and triplet loss functions. Extensive experimental results, conducted on three widely recognized benchmark datasets-AwA2, CUB, and SUN, conclusively affirm the effectiveness and superiority of our proposed approach.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CMLCNet: medical image segmentation network based on convolution capsule encoder and multi-scale local co-occurrence","authors":"Chendong Qin, Yongxiong Wang, Jiapeng Zhang","doi":"10.1007/s00530-024-01430-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01430-9","url":null,"abstract":"<p>Medical images have low contrast and blurred boundaries between different tissues or between tissues and lesions. Because labeling medical images is laborious and requires expert knowledge, the labeled data are expensive or simply unavailable. UNet has achieved great success in the field of medical image segmentation. However, the pooling layer in downsampling tends to discard important information such as location information. It is difficult to learn global and long-range semantic interactive information well due to the locality of convolution operation. The usual solution is increasing the number of datasets or enhancing the training data though augmentation methods. However, to obtain a large number of medical datasets is tough, and the augmentation methods may increase the training burden. In this work, we propose a 2D medical image segmentation network with a convolutional capsule encoder and a multiscale local co-occurrence module. To extract more local detail and contextual information, the capsule encoder is introduced to learn the information about the target location and the relationship between the part and the whole. Multi-scale features can be fused by a new attention mechanism, which can then selectively emphasize salient features useful for a specific task by capturing global information and suppress background noise. The proposed attention mechanism is used to preserve the information that is discarded by pooling layers of the network. In addition, a multi-scale local co-occurrence algorithm is proposed, where the context and dependencies between different regions in an image can be better learned. Experimental results on the dataset of Liver, ISIC and BraTS2019 show that our network is superior to the UNet and other previous medical image segmentation networks under the same experimental conditions.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TrafficTrack: rethinking the motion and appearance cue for multi-vehicle tracking in traffic monitoring","authors":"Hui Cai, Haifeng Lin, Dapeng Liu","doi":"10.1007/s00530-024-01407-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01407-8","url":null,"abstract":"<p>Analyzing traffic flow based on data from traffic monitoring is an essential component of intelligent transportation systems. In most traffic scenarios, vehicles are the primary targets, so multi-object tracking of vehicles in traffic monitoring is a critical subject. In view of the current difficulties, such as complex road conditions, numerous obstructions, and similar vehicle appearances, we propose a detection-based multi-object vehicle tracking algorithm that combines motion and appearance cues. Firstly, to improve the motion prediction accuracy, we propose a Kalman filter that adaptively updates the noise according to the motion matching cost and detection confidence score, combined with exponential transformation and residuals. Then, we propose a combined distance to utilize motion and appearance cues. Finally, we present a trajectory recovery strategy to handle unmatched trajectories and detections. Experimental results on the UA-DETRAC dataset demonstrate that this method achieves excellent tracking performance for vehicle tracking tasks in traffic monitoring perspectives, meeting the practical application demands of complex traffic scenarios.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fs-yolo: fire-smoke detection based on improved YOLOv7","authors":"Dongmei Wang, Ying Qian, Jingyi Lu, Peng Wang, Zhongrui Hu, Yongkang Chai","doi":"10.1007/s00530-024-01359-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01359-z","url":null,"abstract":"<p>Fire has emerged as a major danger to the Earth’s ecological equilibrium and human well-being. Fire detection and alert systems are essential. There is a scarcity of public fire datasets with examples of fire and smoke in real-world situations. Moreover, techniques for recognizing items in fire smoke are imprecise and unreliable when it comes to identifying small objects. We developed a dual dataset to evaluate the model’s ability to handle these difficulties. Introducing FS-YOLO, a new fire detection model with improved accuracy. Training YOLOv7 may lead to overfitting because of the large number of parameters and the limited fire detection object categories. YOLOv7 struggles to recognize small dense objects during feature extraction, resulting in missed detections. The Swin Transformer module has been enhanced to decrease local feature interdependence, obtain a wider range of parameters, and handle features at several levels. The improvements strengthen the model’s robustness and the network’s ability to recognize dense tiny objects. The efficient channel attention was incorporated to reduce the occurrence of false fire detections. Localizing the region of interest and extracting meaningful information aids the model in identifying pertinent areas and minimizing false detections. The proposal also considers using fire-smoke and real-fire-smoke datasets. The latter dataset simulates real-world conditions with occlusions, lens blur, and motion blur. This dataset tests the model’s robustness and adaptability in complex situations. On both datasets, the mAP of FS-YOLO is improved by 6.4<span>(%)</span> and 5.4<span>(%)</span> compared to YOLOv7. In the robustness check experiments, the mAP of FS-YOLO is 4.1<span>(%)</span> and 3.1<span>(%)</span> higher than that of today’s SOTA models YOLOv8s, DINO.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Zhang, Yinke Dou, Kai Yang, Xiaoyang Song, Jin Wang, Liangliang Zhao
{"title":"Insulator defect detection based on BaS-YOLOv5","authors":"Yu Zhang, Yinke Dou, Kai Yang, Xiaoyang Song, Jin Wang, Liangliang Zhao","doi":"10.1007/s00530-024-01413-w","DOIUrl":"https://doi.org/10.1007/s00530-024-01413-w","url":null,"abstract":"<p>Currently, the use of deep learning technologies for detecting defects in transmission line insulators based on images obtained through unmanned aerial vehicle inspection simultaneously presents the problems of insufficient detection accuracy and speed. Therefore, this study first introduced the bidirectional feature pyramid network (BiFPN) module into YOLOv5 to achieve high detection speed as well as enable the combination of image features at different scales, enhance information representation, and allow accurate detection of insulator defect at different scales. Subsequently, the BiFPN module and simple parameter-free attention module (SimAM) were combined to improve the feature representation ability and object detection accuracy. The SimAM also enabled fusion of features at multiple scales, further improving the insulator defect detection performance. Finally, multiple experimental controls were designed to verify the effectiveness and efficiency of the proposed model. The experimental results obtained using self-made datasets show that the combined BiFPN and SimAM model (i.e., the improved BaS-YOLOv5 model) performs better than the original YOLOv5 model; the precision, recall, average precision and F1 score increased by 6.2%, 5%, 5.9%, and 6%, respectively. Therefore, BaS-YOLOv5 substantially improves detection accuracy while maintaining a high detection speed, meeting the requirements for real-time insulator defect detection.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":null,"pages":null},"PeriodicalIF":3.9,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141784565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}