{"title":"Filter-deform attention GAN: constructing human motion videos from few images","authors":"Jianjun Zhu, Huihuang Zhao, Yudong Zhang","doi":"10.1007/s00371-024-03595-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03595-w","url":null,"abstract":"<p>Human motion transfer is challenging due to the complexity and diversity of human motion and clothing textures. Existing methods use 2D pose estimation to obtain poses, which can easily lead to unsmooth motion and artifacts. Therefore, this paper proposes a highly robust motion transmission model based on image deformation, called the Filter-Deform Attention Generative Adversarial Network (FDA GAN). This method can transmit complex human motion videos using only few human images. First, we use a 3D pose shape estimator instead of traditional 2D pose estimation to address the problem of unsmooth motion. Then, to tackle the artifact problem, we design a new attention mechanism and integrate it with the GAN, proposing a new network capable of effectively extracting image features and generating human motion videos. Finally, to further transfer the style of the source human, we propose a two-stream style loss, which enhances the model’s learning ability. Experimental results demonstrate that the proposed method outperforms recent methods in overall performance and various evaluation metrics. Project page: https://github.com/mioyeah/FDA-GAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GLDC: combining global and local consistency of multibranch depth completion","authors":"Yaping Deng, Yingjiang Li, Zibo Wei, Keying Li","doi":"10.1007/s00371-024-03609-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03609-7","url":null,"abstract":"<p>Depth completion aims to generate dense depth maps from sparse depth maps and corresponding RGB images. In this task, the locality based on the convolutional layer poses challenges for the network in obtaining global information. While the Transformer-based architecture performs well in capturing global information, it may lead to the loss of local detail features. Consequently, improving the simultaneous attention to global and local information is crucial for achieving effective depth completion. This paper proposes a novel and effective dual-encoder–three-decoder network, consisting of local and global branches. Specifically, the local branch uses a convolutional network, and the global branch utilizes a Transformer network to extract rich features. Meanwhile, the local branch is dominated by color image and the global branch is dominated by depth map to thoroughly integrate and utilize multimodal information. In addition, a gate fusion mechanism is used in the decoder stage to fuse local and global information, to achieving high-performance depth completion. This hybrid architecture is conducive to the effective fusion of local detail information and contextual information. Experimental results demonstrated the superiority of our method over other advanced methods on KITTI Depth Completion and NYU v2 datasets.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenzhe Shi, Ziqi Hu, Hao Chen, Hengjia Zhang, Jiale Yang, Li Li
{"title":"Orhlr-net: one-stage residual learning network for joint single-image specular highlight detection and removal","authors":"Wenzhe Shi, Ziqi Hu, Hao Chen, Hengjia Zhang, Jiale Yang, Li Li","doi":"10.1007/s00371-024-03607-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03607-9","url":null,"abstract":"<p>Detecting and removing specular highlights is a complex task that can greatly enhance various visual tasks in real-world environments. Although previous works have made great progress, they often ignore specular highlight areas or produce unsatisfactory results with visual artifacts such as color distortion. In this paper, we present a framework that utilizes an encoder–decoder structure for the combined task of specular highlight detection and removal in single images, employing specular highlight mask guidance. The encoder uses EfficientNet as a feature extraction backbone network to convert the input RGB image into a series of feature maps. The decoder gradually restores these feature maps to their original size through up-sampling. In the specular highlight detection module, we enhance the network by utilizing residual modules to extract additional feature information, thereby improving detection accuracy. For the specular highlight removal module, we introduce the Convolutional Block Attention Module, which dynamically captures the importance of each channel and spatial location in the input feature map. This enables the model to effectively distinguish between foreground and background, resulting in enhanced adaptability and accuracy in complex scenes. We evaluate the proposed method on the publicly available SHIQ dataset, and its superiority is demonstrated through a comparative analysis of the experimental results. The source code will be available at https://github.com/hzq2333/ORHLR-Net.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Slot-VTON: subject-driven diffusion-based virtual try-on with slot attention","authors":"Jianglei Ye, Yigang Wang, Fengmao Xie, Qin Wang, Xiaoling Gu, Zizhao Wu","doi":"10.1007/s00371-024-03603-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03603-z","url":null,"abstract":"<p>Virtual try-on aims to transfer clothes from one image to another while preserving intricate wearer and clothing details. Tremendous efforts have been made to facilitate the task based on deep generative models such as GAN and diffusion models; however, the current methods have not taken into account the influence of the natural environment (background and unrelated impurities) on clothing image, leading to issues such as loss of detail, intricate textures, shadows, and folds. In this paper, we introduce Slot-VTON, a slot attention-based inpainting approach for seamless image generation in a subject-driven way. Specifically, we adopt an attention mechanism, termed slot attention, that can unsupervisedly separate the various subjects within images. With slot attention, we distill the clothing image into a series of slot representations, where each slot represents a subject. Guided by the extracted clothing slot, our method is capable of eliminating the interference of other unnecessary factors, thereby better preserving the complex details of the clothing. To further enhance the seamless generation of the diffusion model, we design a fusion adapter that integrates multiple conditions, including the slot and other added clothing conditions. In addition, a non-garment inpainting module is used to further fix visible seams and preserve non-clothing area details (hands, neck, etc.). Multiple experiments on VITON-HD datasets validate the efficacy of our methods, showcasing state-of-the-art generation performances. Our implementation is available at: https://github.com/SilverLakee/Slot-VTON.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning","authors":"Gang Chen, Wenju Wang, Haoran Zhou, Xiaolin Wang","doi":"10.1007/s00371-024-03600-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03600-2","url":null,"abstract":"<p>It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mfpenet: multistage foreground-perception enhancement network for remote-sensing scene classification","authors":"Junding Sun, Chenxu Wang, Haifeng Sima, Xiaosheng Wu, Shuihua Wang, Yudong Zhang","doi":"10.1007/s00371-024-03587-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03587-w","url":null,"abstract":"<p>Scene classification plays a vital role in the field of remote-sensing (RS). However, remote-sensing images have the essential properties of complex scene information and large-scale spatial changes, as well as the high similarity between various classes and the significant differences within the same class, which brings great challenges to scene classification. To address these issues, a multistage foreground-perception enhancement network (MFPENet) is proposed to enhance the ability to perceive foreground features, thereby improving classification accuracy. Firstly, to enrich the scene semantics of feature information, a multi-scale feature aggregation module is specifically designed using dilated convolution, which takes the features of different stages of the backbone network as input data to obtain enhanced multiscale features. Then, a novel foreground-perception enhancement module is designed to capture foreground information. Unlike the previous methods, we separate foreground features by designing feature masks and then innovatively explore the symbiotic relationship between foreground features and scene features to improve the recognition ability of foreground features further. Finally, a hierarchical attention module is designed to reduce the interference of redundant background details on classification. By embedding the dependence between adjacent level features into the attention mechanism, the model can pay more accurate attention to the key information. Redundancy is reduced, and the loss of useful information is minimized. Experiments on three public RS scene classification datasets [UC-Merced, Aerial Image Dataset, and NWPU-RESISC45] show that our method achieves highly competitive results. Future work will focus on utilizing the background features outside the effective foreground features in the image as a decision aid to improve the distinguishability between similar scenes. The source code of our proposed algorithm and the related datasets are available at https://github.com/Hpu-wcx/MFPENet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xintao Liu, Yan Gao, Changqing Zhan, Qiao Wangr, Yu Zhang, Yi He, Hongyan Quan
{"title":"Directional latent space representation for medical image segmentation","authors":"Xintao Liu, Yan Gao, Changqing Zhan, Qiao Wangr, Yu Zhang, Yi He, Hongyan Quan","doi":"10.1007/s00371-024-03589-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03589-8","url":null,"abstract":"<p>Excellent medical image segmentation plays an important role in computer-aided diagnosis. Deep mining of pixel semantics is crucial for medical image segmentation. However, previous works on medical semantic segmentation usually overlook the importance of embedding subspace, and lacked the mining of latent space direction information. In this work, we construct global orthogonal bases and channel orthogonal bases in the latent space, which can significantly enhance the feature representation. We propose a novel distance-based segmentation method that decouples the embedding space into sub-embedding spaces of different classes, and then implements pixel level classification based on the distance between its embedding features and the origin of the subspace. Experiments on various public medical image segmentation benchmarks have shown that our model is superior compared to state-of-the-art methods. The code will be published at https://github.com/lxt0525/LSDENet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward robust visual tracking for UAV with adaptive spatial-temporal weighted regularization","authors":"Zhi Chen, Lijun Liu, Zhen Yu","doi":"10.1007/s00371-024-03290-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03290-w","url":null,"abstract":"<p>The unmanned aerial vehicles (UAV) visual object tracking method based on the discriminative correlation filter (DCF) has gained extensive research and attention due to its superior computation and extraordinary progress, but is always suffers from unnecessary boundary effects. To solve the aforementioned problems, a spatial-temporal regularization correlation filter framework is proposed, which is achieved by introducing a constant regularization term to penalize the coefficients of the DCF filter. The tracker can substantially improve the tracking performance but increase computational complexity. However, these kinds of methods make the object fail to adapt to specific appearance variations, and we need to pay much effort in fine-tuning the spatial-temporal regularization weight coefficients. In this work, an adaptive spatial-temporal weighted regularization (ASTWR) model is proposed. An ASTWR module is introduced to obtain the weighted spatial-temporal regularization coefficients automatically. The proposed ASTWR model can deal effectively with complex situations and substantially improve the credibility of tracking results. In addition, an adaptive spatial-temporal constraint adjusting mechanism is proposed. By repressing the drastic appearance changes between adjacent frames, the tracker enables smooth filter learning in the detection phase. Substantial experiments show that the proposed tracker performs favorably against homogeneous UAV-based and DCF-based trackers. Moreover, the ASTWR tracker reaches over 35 FPS on a single CPU platform, and gains an AUC score of 57.9% and 49.7% on the UAV123 and VisDrone2020 datasets, respectively.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xunan Tan, Xiang Suo, Wenjun Li, Lei Bi, Fangshu Yao
{"title":"Data visualization in healthcare and medicine: a survey","authors":"Xunan Tan, Xiang Suo, Wenjun Li, Lei Bi, Fangshu Yao","doi":"10.1007/s00371-024-03586-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03586-x","url":null,"abstract":"<p>Visualization analysis is crucial in healthcare as it provides insights into complex data and aids healthcare professionals in efficiency. Information visualization leverages algorithms to reduce the complexity of high-dimensional heterogeneous data, thereby enhancing healthcare professionals’ understanding of the hidden associations among data structures. In the field of healthcare visualization, efforts have been made to refine and enhance the utility of data through diverse algorithms and visualization techniques. This review aims to summarize the existing research in this domain and identify future research directions. We searched Web of Science, Google Scholar and IEEE Xplore databases, and ultimately, 76 articles were included in our analysis. We collected and synthesized the research findings from these articles, with a focus on visualization, artificial intelligence and supporting tasks in healthcare. Our study revealed that researchers from diverse fields have employed a wide range of visualization techniques to visualize various types of data. We summarized these visualization methods and proposed recommendations for future research. We anticipate that our findings will promote further development and application of visualization techniques in healthcare.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient minor defects detection on steel surface via res-attention and position encoding","authors":"Chuang Wu, Tingqin He","doi":"10.1007/s00371-024-03583-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03583-0","url":null,"abstract":"<p>Impurities and complex manufacturing processes result in many minor, dense steel defects. This situation requires precise defect detection models for effective protection. The single-stage model (based on YOLO) is a popular choice among current models, renowned for its computational efficiency and suitability for real-time online applications. However, existing YOLO-based models often fail to detect small features. To address this issue, we introduce an efficient steel surface defect detection model in YOLOv7, incorporating a feature preservation block (FPB) and location awareness feature pyramid network (LAFPN). The FPB uses shortcut connections that allow the upper layers to access detailed information directly, thus capturing minor defect features more effectively. Furthermore, LAFPN integrates coordinate data during the feature fusion phase, enhancing the detection of minor defects. We introduced a new loss function to identify and locate minor defects accurately. Extensive testing on two public datasets has demonstrated the superior performance of our model compared to five baseline models, achieving an impressive 80.8 mAP on the NEU-DET dataset and 72.6 mAP on the GC10-DET dataset.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}