{"title":"LiteMSNet: a lightweight semantic segmentation network with multi-scale feature extraction for urban streetscape scenes","authors":"Lirong Li, Jiang Ding, Hao Cui, Zhiqiang Chen, Guisheng Liao","doi":"10.1007/s00371-024-03569-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03569-y","url":null,"abstract":"<p>Semantic segmentation plays a pivotal role in computer scene understanding, but it typically requires a large amount of computing to achieve high performance. To achieve a balance between accuracy and complexity, we propose a lightweight semantic segmentation model, termed LiteMSNet (a Lightweight Semantic Segmentation Network with Multi-Scale Feature Extraction for urban streetscape scenes). In this model, we propose a novel Improved Feature Pyramid Network, which embeds a shuffle attention mechanism followed by a stacked Depth-wise Asymmetric Gating Module. Furthermore, a Multi-scale Dilation Pyramid Module is developed to expand the receptive field and capture multi-scale feature information. Finally, the proposed lightweight model integrates two loss mechanisms, the Cross-Entropy and the Dice Loss functions, which effectively mitigate the issue of data imbalance and gradient saturation. Numerical experimental results on the CamVid dataset demonstrate a remarkable mIoU measurement of 70.85% with less than 5M parameters, accompanied by a real-time inference speed of 66.1 FPS, surpassing the existing methods documented in the literature. The code for this work will be made available at https://github.com/River-ding/LiteMSNet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual adaptive local semantic alignment for few-shot fine-grained classification","authors":"Wei Song, Kaili Yang","doi":"10.1007/s00371-024-03576-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03576-z","url":null,"abstract":"<p>Few-shot fine-grained classification (FS-FGC) aims to learn discriminative semantic details (e.g., beaks and wings) with few labeled samples to precisely recognize novel classes. However, existing feature alignment methods mainly use a support set to align the query sample, which may lead to incorrect alignment of local semantic due to interference from background and non-target objects. In addition, these methods do not take into account the discrepancy of semantic information among channels. To address the above issues, we propose an effective dual adaptive local semantic alignment approach, which is composed of the channel semantic alignment module (CSAM) and the spatial semantic alignment module (SSAM). Specifically, CSAM adaptively generates channel weights to highlight discriminative information based on two sub-modules, namely the class-aware attention module and the target-aware attention module. CAM emphasizes the discriminative semantic details of each category in the support set and TAM enhances the target object region of the query image. On the basis of this, SSAM promotes effective alignment of semantically relevant local regions through a spatial bidirectional alignment strategy. Combining two adaptive modules to better capture fine-grained semantic contextual information along two dimensions, channel and spatial improves the accuracy and robustness of FS-FGC. Experimental results on three widely used fine-grained classification datasets demonstrate excellent performance that has significant competitive advantages over current mainstream methods. Codes are available at: https://github.com/kellyagya/DALSA.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STVDNet: spatio-temporal interactive video de-raining network","authors":"Ze Ouyang, Huihuang Zhao, Yudong Zhang, Long Chen","doi":"10.1007/s00371-024-03565-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03565-2","url":null,"abstract":"<p>Video de-raining is of significant importance problem in computer vision as rain streaks adversely affect the visual quality of images and hinder subsequent vision-related tasks. Existing video de-raining methods still face challenges such as black shadows and loss of details. In this paper, we introduced a novel de-raining framework called STVDNet, which effectively solves the issues of black shadows and detail loss after de-raining. STVDNet utilizes a Spatial Detail Feature Extraction Module based on an auto-encoder to capture the spatial characteristics of the video. Additionally, we introduced an innovative interaction between the extracted spatial features and Spatio-Temporal features using LSTM to generate initial de-raining results. Finally, we employed 3D convolution and 2D convolution for the detailed processing of the coarse videos. During the training process, we utilized three loss functions, among which the SSIM loss function was employed to process the generated videos, aiming to enhance their detail structure and color recovery. Through extensive experiments conducted on three public datasets, we demonstrated the superiority of our proposed method over state-of-the-art approaches. We also provide our code and pre-trained models at https://github.com/O-Y-ZONE/STVDNet.git.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu
{"title":"Vman: visual-modified attention network for multimodal paradigms","authors":"Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu","doi":"10.1007/s00371-024-03563-4","DOIUrl":"https://doi.org/10.1007/s00371-024-03563-4","url":null,"abstract":"<p>Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99<span>(%)</span> on the VQA-v2 and makes a breakthrough of 74.41<span>(%)</span> on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Raja Sekar, T. Dhiliphan Rajkumar, Koteswara Rao Anne
{"title":"Deep fake detection using an optimal deep learning model with multi head attention-based feature extraction scheme","authors":"R. Raja Sekar, T. Dhiliphan Rajkumar, Koteswara Rao Anne","doi":"10.1007/s00371-024-03567-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03567-0","url":null,"abstract":"<p>Face forgery, or deep fake, is a frequently used method to produce fake face images, network pornography, blackmail, and other illegal activities. Researchers developed several detection approaches based on the changing traces presented by deep forgery to limit the damage caused by deep fake methods. They obtain limited performance when evaluating cross-datum scenarios. This paper proposes an optimal deep learning approach with an attention-based feature learning scheme to perform DFD more accurately. The proposed system mainly comprises ‘5’ phases: face detection, preprocessing, texture feature extraction, spatial feature extraction, and classification. The face regions are initially detected from the collected data using the Viola–Jones (VJ) algorithm. Then, preprocessing is carried out, which resizes and normalizes the detected face regions to improve their quality for detection purposes. Next, texture features are learned using the Butterfly Optimized Gabor Filter to get information about the local features of objects in an image. Then, the spatial features are extracted using Residual Network-50 with Multi Head Attention (RN50MHA) to represent the data globally. Finally, classification is done using the Optimal Long Short-Term Memory (OLSTM), which classifies the data as fake or real, in which optimization of network is done using Enhanced Archimedes Optimization Algorithm. The proposed system is evaluated on four benchmark datasets such as Face Forensics + + (FF + +), Deepfake Detection Challenge, Celebrity Deepfake (CDF), and Wild Deepfake. The experimental results show that DFD using OLSTM and RN50MHA achieves a higher inter and intra-dataset detection rate than existing state-of-the-art methods.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to sculpt neural cityscapes","authors":"Jialin Zhu, He Wang, David Hogg, Tom Kelly","doi":"10.1007/s00371-024-03528-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03528-7","url":null,"abstract":"<p>We introduce a system that learns to sculpt 3D models of massive urban environments. The majority of humans live their lives in urban environments, using detailed virtual models for applications as diverse as virtual worlds, special effects, and urban planning. Generating such 3D models from exemplars manually is time-consuming, while 3D deep learning approaches have high memory costs. In this paper, we present a technique for training 2D neural networks to repeatedly sculpt a plane into a large-scale 3D urban environment. An initial coarse depth map is created by a GAN model, from which we refine 3D normal and depth using an image translation network regularized by a linear system. The networks are trained using real-world data to allow generative synthesis of meshes at scale. We exploit sculpting from multiple viewpoints to generate a highly detailed, concave, and water-tight 3D mesh. We show cityscapes at scales of <span>(100 times 1600)</span> meters with more than 2 million triangles, and demonstrate that our results are objectively and subjectively similar to our exemplars.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ACL-SAR: model agnostic adversarial contrastive learning for robust skeleton-based action recognition","authors":"Jiaxuan Zhu, Ming Shao, Libo Sun, Siyu Xia","doi":"10.1007/s00371-024-03548-3","DOIUrl":"https://doi.org/10.1007/s00371-024-03548-3","url":null,"abstract":"<p>Human skeleton data have been widely explored in action recognition and the human–computer interface recently, thanks to off-the-shelf motion sensors and cameras. With the widespread usage of deep models on human skeleton data, their vulnerabilities under adversarial attacks have raised increasing security concerns. Although there are several works focusing on attack strategies, fewer efforts are put into defense against adversaries in skeleton-based action recognition, which is nontrivial. In addition, labels required in adversarial learning are another pain in adversarial training-based defense. This paper proposes a robust model agnostic adversarial contrastive learning framework for this task. First, we introduce an adversarial contrastive learning framework for skeleton-based action recognition (ACL-SAR). Second, the nature of cross-view skeleton data enables cross-view adversarial contrastive learning (CV-ACL-SAR) as a further improvement. Third, adversarial attack and defense strategies are investigated, including alternate instance-wise attacks and options in adversarial training. To validate the effectiveness of our method, we conducted extensive experiments on the NTU-RGB+D and HDM05 datasets. The results show that our defense strategies are not only robust to various adversarial attacks but can also maintain generalization.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nauman Ullah Gilal, Marwa Qaraqe, Jens Schneider, Marco Agus
{"title":"Autocleandeepfood: auto-cleaning and data balancing transfer learning for regional gastronomy food computing","authors":"Nauman Ullah Gilal, Marwa Qaraqe, Jens Schneider, Marco Agus","doi":"10.1007/s00371-024-03560-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03560-7","url":null,"abstract":"<p>Food computing has emerged as a promising research field, employing artificial intelligence, deep learning, and data science methodologies to enhance various stages of food production pipelines. To this end, the food computing community has compiled a variety of data sets and developed various deep-learning architectures to perform automatic classification. However, automated food classification presents a significant challenge, particularly when it comes to local and regional cuisines, which are often underrepresented in available public-domain data sets. Nevertheless, obtaining high-quality, well-labeled, and well-balanced real-world labeled images is challenging since manual data curation requires significant human effort and is time-consuming. In contrast, the web has a potentially unlimited source of food data but tapping into this resource has a good chance of corrupted and wrongly labeled images. In addition, the uneven distribution among food categories may lead to data imbalance problems. All these issues make it challenging to create clean data sets for food from web data. To address this issue, we present <i>AutoCleanDeepFood</i>, a novel end-to-end food computing framework for regional gastronomy that contains the following components: (i) a fully automated pre-processing pipeline for custom data sets creation related to specific regional gastronomy, (ii) a transfer learning-based training paradigm to filter out noisy labels through loss ranking, incorporating a Russian Roulette probabilistic approach to mitigate data imbalance problems, and (iii) a method for deploying the resulting model on smartphones for real-time inferences. We assess the performance of our framework on a real-world noisy public domain data set, ETH Food-101, and two novel web-collected datasets, MENA-150 and Pizza-Styles. We demonstrate the filtering capabilities of our proposed method through embedding visualization of the feature space using the t-SNE dimension reduction scheme. Our filtering scheme is efficient and effectively improves accuracy in all cases, boosting performance by 0.96, 0.71, and 1.29% on MENA-150, ETH Food-101, and Pizza-Styles, respectively.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust consistency learning for facial expression recognition under label noise","authors":"Yumei Tan, Haiying Xia, Shuxiang Song","doi":"10.1007/s00371-024-03558-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03558-1","url":null,"abstract":"<p>Label noise is inevitable in facial expression recognition (FER) datasets, especially for datasets that collected by web crawling, crowd sourcing in in-the-wild scenarios, which makes FER task more challenging. Recent advances tackle label noise by leveraging sample selection or constructing label distribution. However, they rely heavily on labels, which can result in confirmation bias issues. In this paper, we present RCL-Net, a simple yet effective robust consistency learning network, which combats label noise by learning robust representations and robust losses. RCL-Net can efficiently tackle facial samples with noisy labels commonly found in real-world datasets. Specifically, we first use a two-view-based backbone to embed facial images into high- and low-dimensional subspaces and then regularize the geometric structure of the high- and low-dimensional subspaces using an unsupervised dual-consistency learning strategy. Benefiting from the unsupervised dual-consistency learning strategy, we can obtain robust representations to combat label noise. Further, we impose a robust consistency regularization technique on the predictions of the classifiers to improve the whole network’s robustness. Comprehensive evaluations on three popular real-world FER datasets demonstrate that RCL-Net can effectively mitigate the impact of label noise, which significantly outperforms state-of-the-art noisy label FER methods. RCL-Net also shows better generalization capability to other tasks like CIFAR100 and Tiny-ImageNet. Our code and models will be available at this https https://github.com/myt889/RCL-Net.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A deep dive into enhancing sharing of naturalistic driving data through face deidentification","authors":"Surendrabikram Thapa, Abhijit Sarkar","doi":"10.1007/s00371-024-03552-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03552-7","url":null,"abstract":"<p>Human factors research in transportation relies on naturalistic driving studies (NDS) which collect real-world data from drivers on actual roads. NDS data offer valuable insights into driving behavior, styles, habits, and safety-critical events. However, these data often contain personally identifiable information (PII), such as driver face videos, which cannot be publicly shared due to privacy concerns. To address this, our paper introduces a comprehensive framework for deidentifying drivers’ face videos, that can facilitate the wide sharing of driver face videos while protecting PII. Leveraging recent advancements in generative adversarial networks (GANs), we explore the efficacy of different face swapping algorithms in preserving essential human factors attributes while anonymizing participants’ identities. Most face swapping algorithms are tested in restricted lighting conditions and indoor settings, there is no known study that tested them in adverse and natural situations. We conducted extensive experiments using large-scale outdoor NDS data, evaluating the quantification of errors associated with head, mouth, and eye movements, along with other attributes important for human factors research. Additionally, we performed qualitative assessments of these methods through human evaluators providing valuable insights into the quality and fidelity of the deidentified videos. We propose the utilization of synthetic faces as substitutes for real faces to enhance generalization. Additionally, we created practical guidelines for video deidentification, emphasizing error threshold creation, spot-checking for abrupt metric changes, and mitigation strategies for reidentification risks. Our findings underscore nuanced challenges in balancing data utility and privacy, offering valuable insights into enhancing face video deidentification techniques in NDS scenarios.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}