{"title":"Temporal superimposed crossover module for effective continuous sign language","authors":"Qidan Zhu, Jing Li, Fei Yuan, Quan Gan","doi":"10.1007/s00138-024-01595-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01595-3","url":null,"abstract":"<p>The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form \"TSCM+2D\" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply \"TSCM+2D\" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dyna-MSDepth: multi-scale self-supervised monocular depth estimation network for visual SLAM in dynamic scenes","authors":"Jianjun Yao, Yingzhao Li, Jiajia Li","doi":"10.1007/s00138-024-01586-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01586-4","url":null,"abstract":"<p>Monocular Simultaneous Localization And Mapping (SLAM) suffers from scale drift, leading to tracking failure due to scale ambiguity. Deep learning has significantly advanced self-supervised monocular depth estimation, enabling scale drift reduction. Nonetheless, current self-supervised learning approaches fail to provide scale-consistent depth maps, estimate depth in dynamic environments, or perceive multi-scale information. In response to these limitations, this paper proposes Dyna-MSDepth, a novel method for estimating multi-scale, stable, and reliable depth maps in dynamic environments. Dyna-MSDepth incorporates multi-scale high-order spatial semantic interaction into self-supervised training. This integration enhances the model’s capacity to discern intricate texture nuances and distant depth cues. Dyna-MSDepth is evaluated on challenging dynamic datasets, including KITTI, TUM, BONN, and DDAD, employing rigorous qualitative evaluations and quantitative experiments. Furthermore, the accuracy of the depth maps estimated by Dyna-MSDepth is assessed in monocular SLAM. Extensive experiments confirm the superior multi-scale depth estimation capabilities of Dyna-MSDepth, highlighting its significant value in dynamic environments. Code is available at https://github.com/Pepper-FlavoredChewingGum/Dyna-MSDepth.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"42 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cmf-transformer: cross-modal fusion transformer for human action recognition","authors":"Jun Wang, Limin Xia, Xin Wen","doi":"10.1007/s00138-024-01598-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01598-0","url":null,"abstract":"<p>In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient driving behavior prediction approach using physiological auxiliary and adaptive LSTM","authors":"Jun Gao, Jiangang Yi, Yi Lu Murphey","doi":"10.1007/s00138-024-01600-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01600-9","url":null,"abstract":"<p>Driving behavior prediction is crucial in designing a modern Advanced driver assistance system (ADAS). Such predictions can improve driving safety by alerting the driver to the danger of unsafe or risky traffic situations. In this research, an efficient approach, Driver behavior network (DBNet) is proposed for driving behavior prediction using multiple modality data, <i>i.e.</i> front view video frames and driver physiological signals. Firstly, a Relation-guided spatial attention (RGSA) module is adopted to generate driving scene-centric features by modeling both local and global information from video frames. Secondly, a new Global shrinkage (GS) block is designed to incorporate soft thresholding as nonlinear transformation layer to generate physiological features and eliminate noise-related information from physiological signals. Finally, a customized Adaptive focal loss based Long short term memory (AFL-LSTM) network is introduced to learn the multi-modal features and capture the dependencies within driving behaviors simultaneously. We applied our approach on real data collected during drives in both urban and freeway environment in an instrumented vehicle. The experimental findings demonstrate that the DBNet can predict the upcoming driving behavior efficiently and significantly outperform other state-of-the-art models.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"42 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust visual-based method and new datasets for ego-lane index estimation in urban environment","authors":"Dianzheng Wang, Dongyi Liang, Shaomiao Li","doi":"10.1007/s00138-024-01590-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01590-8","url":null,"abstract":"<p>Correct and robust ego-lane index estimation is crucial for autonomous driving in the absence of high-definition maps, especially in urban environments. Previous ego-lane index estimation approaches rely on feature extraction, which limits the robustness. To overcome these shortages, this study proposes a robust ego-lane index estimation framework upon only the original visual image. After optimization of the processing route, the raw image was randomly cropped in the height direction and then input into a double supervised LaneLoc network to obtain the index estimations and confidences. A post-process was also proposed to achieve the global ego-lane index from the estimated left and right indexes with the total lane number. To evaluate our proposed method, we manually annotated the ego-lane index of public datasets which can work as an ego-lane index estimation baseline for the first time. The proposed algorithm achieved 96.48/95.40% (precision/recall) on the CULane dataset and 99.45/99.49% (precision/recall) on the TuSimple dataset, demonstrating the effectiveness and efficiency of lane localization in diverse driving environments. The code and dataset annotation results will be exposed publicly on https://github.com/haomo-ai/LaneLoc.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MFFAE-Net: semantic segmentation of point clouds using multi-scale feature fusion and attention enhancement networks","authors":"Wei Liu, Yisheng Lu, Tao Zhang","doi":"10.1007/s00138-024-01589-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01589-1","url":null,"abstract":"<p>Point cloud data can reflect more information about the real 3D space, which has gained increasing attention in computer vision field. But the unstructured and unordered nature of point clouds poses many challenges in their study. How to learn the global features of the point cloud in the original point cloud is a problem that has been accompanied by the research. In the research based on the structure of the encoder and decoder, many researchers focus on designing the encoder to better extract features, and do not further explore more globally representative features according to the features of the encoder and decoder. To solve this problem, we propose the MFFAE-Net method, which aims to obtain more globally representative point cloud features by using the feature learning of encoder decoder stage.Our method first enhances the feature information of the input point cloud by merging the information of its neighboring points, which is helpful for the following point cloud feature extraction work. Secondly, the channel attention module is used to further process the extracted features, so as to highlight the role of important channels in the features. Finally, we fuse features of different scales from encoding features and decoding features as well as features of the same scale, so as to obtain more global point cloud features, which will help improve the segmentation results of point clouds. Experimental results show that the method performs well on some objects in S3DIS dataset and Toronto3d dataset.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"8 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adversarial imitation learning-based network for category-level 6D object pose estimation","authors":"Shantong Sun, Xu Bao, Aryan Kaushik","doi":"10.1007/s00138-024-01592-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01592-6","url":null,"abstract":"<p>Category-level 6D object pose estimation is a very fundamental and key research in computer vision. In order to get rid of the dependence on the object 3D models, analysis-by-synthesis object pose estimation methods have recently been widely studied. While these methods have certain improvements in generalization, the accuracy of category-level object pose estimation still needs to be improved. In this paper, we propose a category-level 6D object pose estimation network based on adversarial imitation learning, named AIL-Net. AIL-Net adopts the state-action distribution matching criterion and is able to perform expert actions that have not appeared in the dataset. This prevents the object pose estimation from falling into a bad state. We further design a framework for estimating object pose through generative adversarial imitation learning. This method is able to distinguish between expert policy and imitation policy in AIL-Net. Experimental results show that our approach achieves competitive category-level object pose estimation performance on REAL275 dataset and Cars dataset.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Active perception based on deep reinforcement learning for autonomous robotic damage inspection","authors":"Wen Tang, Mohammad R. Jahanshahi","doi":"10.1007/s00138-024-01591-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01591-7","url":null,"abstract":"<p>In this study, an artificial intelligence framework is developed to facilitate the use of robotics for autonomous damage inspection. While considerable progress has been achieved by utilizing state-of-the-art computer vision approaches for damage detection, these approaches are still far away from being used for autonomous robotic inspection systems due to the uncertainties in data collection and data interpretation. To address this gap, this study proposes a framework that will enable robots to select the best course of action for active damage perception and reduction of uncertainties. By doing so, the required information is collected efficiently for a better understanding of damage severity which leads to reliable decision-making. More specifically, the active damage perception task is formulated as a Partially Observable Markov Decision Process, and a deep reinforcement learning-based active perception agent is proposed to learn the near-optimal policy for this task. The proposed framework is evaluated for the autonomous assessment of cracks on metallic surfaces of an underwater nuclear reactor. Active perception exhibits a notable enhancement in the crack Intersection over Union (IoU) performance, yielding an increase of up to 69% when compared to its raster scanning counterpart given a similar inspection time. Additionally, the proposed method can perform a rapid inspection that reduces the overall inspection time by more than two times while achieving a 15% higher crack IoU than that of the dense raster scanning approach.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"96 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longyu Dong, Dejun Liu, Youqiang Dong, Bongrae Park, Zhibo Wan
{"title":"An efficient ground segmentation approach for LiDAR point cloud utilizing adjacent grids","authors":"Longyu Dong, Dejun Liu, Youqiang Dong, Bongrae Park, Zhibo Wan","doi":"10.1007/s00138-024-01593-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01593-5","url":null,"abstract":"<p>Ground segmentation is crucial for guiding mobile robots and identifying nearby objects. However, it should be noted that the ground often presents complex topographical features, such as slopes and rugged terrains, which significantly increase the challenges associated with accurate ground segmentation tasks. To address this issue, we propose a novel approach to achieve rapid ground segmentation. The proposed method uses a multi-partition approach to extract ground points for each partition, followed by assessing the correction plane based on geometric characteristics of the ground surface and similarity among adjacent planes. An adaptive threshold is also introduced to enhance efficiency in extracting complex urban pavement. Our method was benchmarked against several contemporary techniques on the SemanticKITTI dataset. The precision was elevated by 1.72<span>(%)</span>, and the precision deviation was diminished by 1.02<span>(%)</span>, culminating in the most accurate and robust outcomes among the evaluated methods.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"142 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141929833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenxing Xia, Huizhen Cao, Xiuju Gao, Bin Ge, Kuan-Ching Li, Xianjin Fang, Yan Zhang, Xingzhu Liang
{"title":"Boundary enhancement and refinement network for camouflaged object detection","authors":"Chenxing Xia, Huizhen Cao, Xiuju Gao, Bin Ge, Kuan-Ching Li, Xianjin Fang, Yan Zhang, Xingzhu Liang","doi":"10.1007/s00138-024-01588-2","DOIUrl":"https://doi.org/10.1007/s00138-024-01588-2","url":null,"abstract":"<p>Camouflaged object detection aims to locate and segment objects accurately that conceal themselves well in the environment. Despite the advancements in deep learning methods, prevalent issues persist, including coarse boundary identification in complex scenes and the ineffective integration of multi-source features. To this end, we propose a novel boundary enhancement and refinement network named BERNet, which mainly consists of three modules for enhancing and refining boundary information: an asymmetric edge module (AEM) with multi-groups dilated convolution block (GDCB), a residual mixed pooling enhanced module (RPEM), and a multivariate information interaction refiner module (M2IRM). AEM with GDCB is designed to obtain rich boundary clues, where different dilation rates are used to expand the receptive field. RPEM is capable of enhancing boundary features under the guidance of boundary cues to improve the detection accuracy of small and multiple camouflaged objects. M2IRM is introduced to refine the side-out prediction maps progressively under the supervision of the ground truth by the fusion of multi-source information. Comprehensive experiments on three benchmark datasets demonstrate the effectiveness of our BERNet with competitive state-of-the-art methods under the most evaluation metrics.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"33 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141884135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}