Muhammad Fahad, Tao Zhang, Yasir Iqbal, Azaz Ikram, Fazeela Siddiqui, Bin Younas Abdullah, Malik Muhammad Nauman, Xin Zhao, Yanzhang Geng
{"title":"Advanced deepfake detection with enhanced Resnet-18 and multilayer CNN max pooling","authors":"Muhammad Fahad, Tao Zhang, Yasir Iqbal, Azaz Ikram, Fazeela Siddiqui, Bin Younas Abdullah, Malik Muhammad Nauman, Xin Zhao, Yanzhang Geng","doi":"10.1007/s00371-024-03613-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03613-x","url":null,"abstract":"<p>Artificial intelligence has revolutionized technology, with generative adversarial networks (GANs) generating fake samples and deepfake videos. These technologies can lead to panic and instability, allowing anyone to produce propaganda. Therefore, it is crucial to develop a robust system to distinguish between authentic and counterfeit information in the current social media era. This study offers an automated approach for categorizing deepfake videos using advanced machine learning and deep learning techniques. The processed videos are classified using a deep learning-based enhanced Resnet-18 with convolutional neural network (CNN) multilayer max pooling. This research contributes to studying precise detection techniques for deepfake technology, which is gradually becoming a serious problem for digital media. The proposed enhanced Resnet-18 CNN method integrates deep learning algorithms on GAN architecture and artificial intelligence-generated videos to analyze and determine genuine and fake videos. In this research, we fuse the sub-datasets (faceswap, face2face, deepfakes, neural textures) of FaceForensics, CelebDF, DeeperForensics, DeepFake detection and our own created private dataset into one combined dataset, and the total number of videos are (11,404) in this fused dataset. The dataset on which it was trained has a diverse range of videos and sentiments, demonstrating its capability. The structure of the model is designed to predict and identify videos with faces accurately switched as fakes, while those without switches are real. This paper is a great leap forward in the area of digital forensics, providing an excellent response to deepfakes. The proposed model outperformed conventional methods in predicting video frames, with an accuracy score of 99.99%, F-score of 99.98%, recall of 100%, and precision of 99.99%, confirming its effectiveness through a comparative analysis. The source code of this study is available publically at https://doi.org/10.5281/zenodo.12538330.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video-driven musical composition using large language model with memory-augmented state space","authors":"Wan-He Kai, Kai-Xin Xing","doi":"10.1007/s00371-024-03606-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03606-w","url":null,"abstract":"<p>The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. However, the research work on LLms for music inspiration is still in its infancy. To fill the gap in this field and break through the dilemma that LLMs can only understand short videos with limited frames, we propose a large language model with state space for long-term video-to-music generation. To capture long-range dependency and maintaining high performance, while further decrease the computing cost, our overall network includes the Enhanced Video Mamba, which incorporates continuous moving window partitioning and local feature augmentation, and a long-term memory bank that captures and aggregates historical video information to mitigate information loss in long sequences. This framework achieves both subquadratic-time computation and near-linear memory complexity, enabling effective long-term video-to-music generation. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models. Our code released on https://github.com/kai211233/S2L2-V2M.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingquan Cai, Haoyu Zhang, LiZhe Chen, YiJie Wu, Haiyan Sun
{"title":"3D human pose estimation using spatiotemporal hypergraphs and its public benchmark on opera videos","authors":"Xingquan Cai, Haoyu Zhang, LiZhe Chen, YiJie Wu, Haiyan Sun","doi":"10.1007/s00371-024-03604-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03604-y","url":null,"abstract":"<p>Graph convolutional networks significantly improve the 3D human pose estimation accuracy by representing the human skeleton as an undirected spatiotemporal graph. However, this representation fails to reflect the cross-connection interactions of multiple joints, and the current 3D human pose estimation methods have larger errors in opera videos due to the occlusion of clothing and movements in opera videos. In this paper, we propose a 3D human pose estimation method based on spatiotemporal hypergraphs for opera videos. <i>First, the 2D human pose sequence of the opera video performer is inputted, and based on the interaction information between multiple joints in the opera action, multiple spatiotemporal hypergraphs representing the spatial correlation and temporal continuity of the joints are generated. Then, a hypergraph convolution network is constructed using the joints spatiotemporal hypergraphs to extract the spatiotemporal features in the 2D human poses sequence. Finally, a multi-hypergraph cross-attention mechanism is introduced to strengthen the correlation between spatiotemporal hypergraphs and predict 3D human poses</i>. Experiments show that our method achieves the best performance on the Human3.6M and MPI-INF-3DHP datasets compared to the graph convolutional network and Transformer-based methods. In addition, ablation experiments show that the multiple spatiotemporal hypergraphs we generate can effectively improve the network accuracy compared to the undirected spatiotemporal graph. The experiments demonstrate that the method can obtain accurate 3D human poses in the presence of clothing and limb occlusion in opera videos. Codes will be available at: https://github.com/zhanghaoyu0408/hyperAzzy.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zhou, Haibin Zhou, Yin Yang, Jianxun Li, Richard Irampaye, Dongli Wang, Zhengpeng Zhang
{"title":"Lunet: an enhanced upsampling fusion network with efficient self-attention for semantic segmentation","authors":"Yan Zhou, Haibin Zhou, Yin Yang, Jianxun Li, Richard Irampaye, Dongli Wang, Zhengpeng Zhang","doi":"10.1007/s00371-024-03590-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03590-1","url":null,"abstract":"<p>Semantic segmentation is an essential aspect of many computer vision tasks. Self-attention (SA)-based deep learning methods have shown impressive results in semantic segmentation by capturing long-range dependencies and contextual information. However, the standard SA module has high computational complexity, which limits its use in resource-constrained scenarios. This paper proposes a novel LUNet to improve semantic segmentation performance while addressing the computational challenges of SA. The lightweight self-attention plus (LSA++) module is introduced as a lightweight and efficient variant of the SA module. LSA++ uses compact feature representation and local position embedding to significantly reduce computational complexity while surpassing the accuracy of the standard SA module. Furthermore, to address the loss of edge details during decoding, we propose the enhanced upsampling fusion module (EUP-FM). This module comprises an enhanced upsampling module and a semantic vector-guided fusion mechanism. EUP-FM effectively recovers edge information and improves the precision of the segmentation map. Comprehensive experiments on PASCAL VOC 2012, Cityscapes, COCO, and SegPC 2021 demonstrate that LUNet outperforms all compared methods. It achieves superior runtime performance and accurate segmentation with excellent model generalization ability. The code is available at https://github.com/hbzhou530/LUNet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution","authors":"Xiaohu Wang, Xin Yang, Hengrui Li, Tao Li","doi":"10.1007/s00371-024-03621-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03621-x","url":null,"abstract":"<p>Currently, the mainstream deep video super-resolution (VSR) models typically employ deeper neural network layers or larger receptive fields. This approach increases computational requirements, making network training difficult and inefficient. Therefore, this paper proposes a VSR model called fusion of deformable 3D convolution and cheap convolution (FDDCC-VSR).In FDDCC-VSR, we first divide the detailed features of each frame in VSR into dynamic features of visual moving objects and details of static backgrounds. This division allows for the use of fewer specialized convolutions in feature extraction, resulting in a lightweight network that is easier to train. Furthermore, FDDCC-VSR incorporates multiple D-C CRBs (Convolutional Residual Blocks), which establish a lightweight spatial attention mechanism to aid deformable 3D convolution. This enables the model to focus on learning the corresponding feature details. Finally, we employ an improved bicubic interpolation combined with subpixel techniques to enhance the PSNR (Peak Signal-to-Noise Ratio) value of the original image. Detailed experiments demonstrate that FDDCC-VSR outperforms the most advanced algorithms in terms of both subjective visual effects and objective evaluation criteria. Additionally, our model exhibits a small parameter and calculation overhead.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topological structure extraction for computing surface–surface intersection curves","authors":"Pengbo Bo, Qingxiang Liu, Caiming Zhang","doi":"10.1007/s00371-024-03616-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03616-8","url":null,"abstract":"<p>Surface–surface intersection curve computation is a fundamental problem in CAD and solid modeling. Extracting the structure of intersection curves accurately, especially when there are multiple overlapping curves, is a key challenge. Existing methods rely on densely sampled intersection points and proximity-based connections, which are time-consuming to obtain. In this paper, we propose a novel method based on Delaunay triangulation to accurately extract intersection curves, even with sparse intersection points. We also introduce an intersection curve optimization technique to enhance curve accuracy. Extensive experiments on various examples demonstrate the effectiveness of our method.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sunhan Xu, Jinhua Wang, Ning He, Guangmei Xu, Geng Zhang
{"title":"Optimizing underwater image enhancement: integrating semi-supervised learning and multi-scale aggregated attention","authors":"Sunhan Xu, Jinhua Wang, Ning He, Guangmei Xu, Geng Zhang","doi":"10.1007/s00371-024-03611-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03611-z","url":null,"abstract":"<p>Underwater image enhancement is critical for advancing marine science and underwater engineering. Traditional methods often struggle with color distortion, low contrast, and blurred details due to the challenging underwater environment. Addressing these issues, we introduce a semi-supervised underwater image enhancement framework, Semi-UIE, which leverages unlabeled data alongside limited labeled data to significantly enhance generalization capabilities. This framework integrates a novel aggregated attention within a UNet architecture, utilizing multi-scale convolutional kernels for efficient feature aggregation. This approach not only improves the sharpness and authenticity of underwater visuals but also ensures substantial computational efficiency. Importantly, Semi-UIE excels in capturing both macro- and micro-level details, effectively addressing common issues of over-correction and detail loss. Our experimental results demonstrate a marked improvement in performance on several public datasets, including UIEBD and EUVP, with notable enhancements in image quality metrics compared to existing methods. The robustness of our model across diverse underwater environments is confirmed by its superior performance on unlabeled datasets. Our code and pre-trained models are available at https://github.com/Sunhan-Ash/Semi-UIE.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FFCANet: a frequency channel fusion coordinate attention mechanism network for lane detection","authors":"Shijie Li, Shanhua Yao, Zhonggen Wang, Juan Wu","doi":"10.1007/s00371-024-03626-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03626-6","url":null,"abstract":"<p>Lane line detection becomes a challenging task in complex and dynamic driving scenarios. Addressing the limitations of existing lane line detection algorithms, which struggle to balance accuracy and efficiency in complex and changing traffic scenarios, a frequency channel fusion coordinate attention mechanism network (FFCANet) for lane detection is proposed. A residual neural network (ResNet) is used as a feature extraction backbone network. We propose a feature enhancement method with a frequency channel fusion coordinate attention mechanism (FFCA) that captures feature information from different spatial orientations and then uses multiple frequency components to extract detail and texture features of lane lines. A row-anchor-based prediction and classification method treats lane line detection as a problem of selecting lane marking anchors within row-oriented cells predefined by global features, which greatly improves the detection speed and can handle visionless driving scenarios. Additionally, an efficient channel attention (ECA) module is integrated into the auxiliary segmentation branch to capture dynamic dependencies between channels, further enhancing feature extraction capabilities. The performance of the model is evaluated on two publicly available datasets, TuSimple and CULane. Simulation results demonstrate that the average processing time per image frame is 5.0 ms, with an accuracy of 96.09% on the TuSimple dataset and an F1 score of 72.8% on the CULane dataset. The model exhibits excellent robustness in detecting complex scenes while effectively balancing detection accuracy and speed. The source code is available at https://github.com/lsj1012/FFCANet/tree/master</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenji Yang, Hang An, Wenchao Hu, Xinxin Ma, Liping Xie
{"title":"Text-guided floral image generation based on lightweight deep attention feature fusion GAN","authors":"Wenji Yang, Hang An, Wenchao Hu, Xinxin Ma, Liping Xie","doi":"10.1007/s00371-024-03617-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03617-7","url":null,"abstract":"<p>Generating floral images conditioned on textual descriptions is a highly challenging task. However, most existing text-to-floral image synthesis methods adopt a single-stage generation architecture, which often requires substantial hardware resources, such as large-scale GPU clusters and a large number of training images. Moreover, this architecture tends to lose some detail features when shallow image features are fused with deep image features. To address these challenges, this paper proposes a Lightweight Deep Attention Feature Fusion Generative Adversarial Network for the text-to-floral image generation task. This network performs impressively well even with limited hardware resources. Specifically, we introduce a novel Deep Attention Text-Image Fusion Block that leverages Multi-scale Channel Attention Mechanisms to effectively enhance the capability of displaying details and visual consistency in text-generated floral images. Secondly, we propose a novel Self-Supervised Target-Aware Discriminator capable of learning a richer feature mapping coverage area from input images. This not only aids the generator in creating higher-quality images but also improves the training efficiency of GANs, further reducing resource consumption. Finally, extensive experiments on dataset of three different sample sizes validate the effectiveness of the proposed model. Source code and pretrained models are available at https://github.com/BoomAnm/LDAF-GAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research on a small target object detection method for aerial photography based on improved YOLOv7","authors":"Jiajun Yang, Xuesong Zhang, Cunli Song","doi":"10.1007/s00371-024-03615-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03615-9","url":null,"abstract":"<p>In aerial imagery analysis, detecting small targets is highly challenging due to their minimal pixel representation and complex backgrounds. To address this issue, this manuscript proposes a novel method for detecting small aerial targets. Firstly, the K-means + + algorithm is utilized to generate anchor boxes suitable for small targets. Secondly, the YOLOv7-BFAW model is proposed. This method incorporates a series of improvements to YOLOv7, including the integration of a BBF residual structure based on BiFormer and BottleNeck fusion into the backbone network, the design of an MPsim module based on simAM attention for the head network, and the development of a novel loss function, inner-WIoU v2, as the localization loss function, based on WIoU v2. Experiments demonstrate that YOLOv7-BFAW achieves a 4.2% mAP@.5 improvement on the DOTA v1.0 dataset and a 1.7% mAP@.5 improvement on the VisDrone2019 dataset, showcasing excellent generalization capabilities. Furthermore, it is shown that YOLOv7-BFAW exhibits superior detection performance compared to state-of-the-art algorithms.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}