Machine Vision and Applications最新文献_第2页

Transformer with multi-level grid features and depth pooling for image captioning 具有多级网格特征和深度汇集功能的变换器，用于图像字幕制作

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-20 DOI: 10.1007/s00138-024-01599-z

Doanh C. Bui, Tam V. Nguyen, Khang Nguyen

{"title":"Transformer with multi-level grid features and depth pooling for image captioning","authors":"Doanh C. Bui, Tam V. Nguyen, Khang Nguyen","doi":"10.1007/s00138-024-01599-z","DOIUrl":"https://doi.org/10.1007/s00138-024-01599-z","url":null,"abstract":"Image captioning is an exciting yet challenging problem in both computer vision and natural language processing research. In recent years, this problem has been addressed by Transformer-based models optimized with Cross-Entropy loss and boosted performance via Self-Critical Sequence Training. Two types of representations are embedded into captioning models: grid features and region features, and there have been attempts to include 2D geometry information in the self-attention computation. However, the 3D order of object appearances is not considered, leading to confusion for the model in cases of complex scenes with overlapped objects. In addition, recent studies using only feature maps from the last layer or block of a pretrained CNN-based model may lack spatial information. In this paper, we present the Transformer-based captioning model dubbed TMDNet. Our model includes one module to aggregate multi-level grid features (MGFA) to enrich the representation ability using prior knowledge, and another module to effectively embed the image’s depth-grid aggregation (DGA) into the model space for better performance. The proposed model demonstrates its effectiveness via evaluation on the MS-COCO “Karpathy” test split across five standard metrics.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal superimposed crossover module for effective continuous sign language 有效连续手语的时空叠加交叉模块

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-19 DOI: 10.1007/s00138-024-01595-3

Qidan Zhu, Jing Li, Fei Yuan, Quan Gan

{"title":"Temporal superimposed crossover module for effective continuous sign language","authors":"Qidan Zhu, Jing Li, Fei Yuan, Quan Gan","doi":"10.1007/s00138-024-01595-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01595-3","url":null,"abstract":"The ultimate goal of continuous sign language recognition is to facilitate communication between special populations and normal people, which places high demands on the real-time and deployable nature of the model. However, researchers have paid little attention to these two properties in previous studies on CSLR. In this paper, we propose a novel CSLR model ResNetT based on temporal superposition crossover module and ResNet, which replaces the parameterized computation with shifts in the temporal dimension and efficiently extracts temporal features without increasing the number of parameters and computation. The ResNetT is able to improve the real-time performance and deployability of the model while ensuring its accuracy. The core is our proposed zero-parameter and zero-computation module TSCM, and we combine TSCM with 2D convolution to form \"TSCM+2D\" hybrid convolution, which provides powerful spatial-temporal modeling capability, zero-parameter increase, and lower deployment cost compared with other spatial-temporal convolutions. Further, we apply \"TSCM+2D\" to ResBlock to form the new ResBlockT, which is the basis of the novel CSLR model ResNetT. We introduce stochastic gradient stops and multilevel connected temporal classification (CTC) loss to train this model, which reduces training memory usage while decreasing the final recognized word error rate (WER) and extends the ResNet network from image classification tasks to video recognition tasks. In addition, this study is the first in the field of CSLR to use only 2D convolution to extract spatial-temporal features of sign language videos for end-to-end recognition learning. Experiments on two large-scale continuous sign language datasets demonstrate the efficiency of the method.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"9 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dyna-MSDepth: multi-scale self-supervised monocular depth estimation network for visual SLAM in dynamic scenes Dyna-MSDepth：用于动态场景中视觉 SLAM 的多尺度自监督单目深度估计网络

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-19 DOI: 10.1007/s00138-024-01586-4

Jianjun Yao, Yingzhao Li, Jiajia Li

{"title":"Dyna-MSDepth: multi-scale self-supervised monocular depth estimation network for visual SLAM in dynamic scenes","authors":"Jianjun Yao, Yingzhao Li, Jiajia Li","doi":"10.1007/s00138-024-01586-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01586-4","url":null,"abstract":"Monocular Simultaneous Localization And Mapping (SLAM) suffers from scale drift, leading to tracking failure due to scale ambiguity. Deep learning has significantly advanced self-supervised monocular depth estimation, enabling scale drift reduction. Nonetheless, current self-supervised learning approaches fail to provide scale-consistent depth maps, estimate depth in dynamic environments, or perceive multi-scale information. In response to these limitations, this paper proposes Dyna-MSDepth, a novel method for estimating multi-scale, stable, and reliable depth maps in dynamic environments. Dyna-MSDepth incorporates multi-scale high-order spatial semantic interaction into self-supervised training. This integration enhances the model’s capacity to discern intricate texture nuances and distant depth cues. Dyna-MSDepth is evaluated on challenging dynamic datasets, including KITTI, TUM, BONN, and DDAD, employing rigorous qualitative evaluations and quantitative experiments. Furthermore, the accuracy of the depth maps estimated by Dyna-MSDepth is assessed in monocular SLAM. Extensive experiments confirm the superior multi-scale depth estimation capabilities of Dyna-MSDepth, highlighting its significant value in dynamic environments. Code is available at https://github.com/Pepper-FlavoredChewingGum/Dyna-MSDepth.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"42 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cmf-transformer: cross-modal fusion transformer for human action recognition Cmf-转换器：用于人类动作识别的跨模态融合转换器

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-17 DOI: 10.1007/s00138-024-01598-0

Jun Wang, Limin Xia, Xin Wen

{"title":"Cmf-transformer: cross-modal fusion transformer for human action recognition","authors":"Jun Wang, Limin Xia, Xin Wen","doi":"10.1007/s00138-024-01598-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01598-0","url":null,"abstract":"In human action recognition, both spatio-temporal videos and skeleton features alone can achieve good recognition performance, however, how to combine these two modalities to achieve better performance is still a worthy research direction. In order to better combine the two modalities, we propose a novel Cross-Modal Transformer for human action recognition—CMF-Transformer, which effectively fuses two different modalities. In spatio-temporal modality, video frames are used as inputs and directional attention is used in the transformer to obtain the order of recognition between different spatio-temporal blocks. In skeleton joint modality, skeleton joints are used as inputs to explore more complete correlations in different skeleton joints by spatio-temporal cross-attention in the transformer. Subsequently, a multimodal collaborative recognition strategy is used to identify the respective features and connectivity features of two modalities separately, and then weight the identification results separately to synergistically identify target action by fusing the features under the two modalities. A series of experiments on three benchmark datasets demonstrate that the performance of CMF-Transformer in this paper outperforms most current state-of-the-art methods.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An efficient driving behavior prediction approach using physiological auxiliary and adaptive LSTM 使用生理辅助和自适应 LSTM 的高效驾驶行为预测方法

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-14 DOI: 10.1007/s00138-024-01600-9

Jun Gao, Jiangang Yi, Yi Lu Murphey

{"title":"An efficient driving behavior prediction approach using physiological auxiliary and adaptive LSTM","authors":"Jun Gao, Jiangang Yi, Yi Lu Murphey","doi":"10.1007/s00138-024-01600-9","DOIUrl":"https://doi.org/10.1007/s00138-024-01600-9","url":null,"abstract":"Driving behavior prediction is crucial in designing a modern Advanced driver assistance system (ADAS). Such predictions can improve driving safety by alerting the driver to the danger of unsafe or risky traffic situations. In this research, an efficient approach, Driver behavior network (DBNet) is proposed for driving behavior prediction using multiple modality data, i.e. front view video frames and driver physiological signals. Firstly, a Relation-guided spatial attention (RGSA) module is adopted to generate driving scene-centric features by modeling both local and global information from video frames. Secondly, a new Global shrinkage (GS) block is designed to incorporate soft thresholding as nonlinear transformation layer to generate physiological features and eliminate noise-related information from physiological signals. Finally, a customized Adaptive focal loss based Long short term memory (AFL-LSTM) network is introduced to learn the multi-modal features and capture the dependencies within driving behaviors simultaneously. We applied our approach on real data collected during drives in both urban and freeway environment in an instrumented vehicle. The experimental findings demonstrate that the DBNet can predict the upcoming driving behavior efficiently and significantly outperform other state-of-the-art models.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"42 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust visual-based method and new datasets for ego-lane index estimation in urban environment 用于估算城市环境中自我车道指数的基于视觉的稳健方法和新数据集

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-14 DOI: 10.1007/s00138-024-01590-8

Dianzheng Wang, Dongyi Liang, Shaomiao Li

{"title":"Robust visual-based method and new datasets for ego-lane index estimation in urban environment","authors":"Dianzheng Wang, Dongyi Liang, Shaomiao Li","doi":"10.1007/s00138-024-01590-8","DOIUrl":"https://doi.org/10.1007/s00138-024-01590-8","url":null,"abstract":"Correct and robust ego-lane index estimation is crucial for autonomous driving in the absence of high-definition maps, especially in urban environments. Previous ego-lane index estimation approaches rely on feature extraction, which limits the robustness. To overcome these shortages, this study proposes a robust ego-lane index estimation framework upon only the original visual image. After optimization of the processing route, the raw image was randomly cropped in the height direction and then input into a double supervised LaneLoc network to obtain the index estimations and confidences. A post-process was also proposed to achieve the global ego-lane index from the estimated left and right indexes with the total lane number. To evaluate our proposed method, we manually annotated the ego-lane index of public datasets which can work as an ego-lane index estimation baseline for the first time. The proposed algorithm achieved 96.48/95.40% (precision/recall) on the CULane dataset and 99.45/99.49% (precision/recall) on the TuSimple dataset, demonstrating the effectiveness and efficiency of lane localization in diverse driving environments. The code and dataset annotation results will be exposed publicly on https://github.com/haomo-ai/LaneLoc.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"34 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142190051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFFAE-Net: semantic segmentation of point clouds using multi-scale feature fusion and attention enhancement networks MFFAE-Net：利用多尺度特征融合和注意力增强网络进行点云语义分割

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-12 DOI: 10.1007/s00138-024-01589-1

Wei Liu, Yisheng Lu, Tao Zhang

{"title":"MFFAE-Net: semantic segmentation of point clouds using multi-scale feature fusion and attention enhancement networks","authors":"Wei Liu, Yisheng Lu, Tao Zhang","doi":"10.1007/s00138-024-01589-1","DOIUrl":"https://doi.org/10.1007/s00138-024-01589-1","url":null,"abstract":"Point cloud data can reflect more information about the real 3D space, which has gained increasing attention in computer vision field. But the unstructured and unordered nature of point clouds poses many challenges in their study. How to learn the global features of the point cloud in the original point cloud is a problem that has been accompanied by the research. In the research based on the structure of the encoder and decoder, many researchers focus on designing the encoder to better extract features, and do not further explore more globally representative features according to the features of the encoder and decoder. To solve this problem, we propose the MFFAE-Net method, which aims to obtain more globally representative point cloud features by using the feature learning of encoder decoder stage.Our method first enhances the feature information of the input point cloud by merging the information of its neighboring points, which is helpful for the following point cloud feature extraction work. Secondly, the channel attention module is used to further process the extracted features, so as to highlight the role of important channels in the features. Finally, we fuse features of different scales from encoding features and decoding features as well as features of the same scale, so as to obtain more global point cloud features, which will help improve the segmentation results of point clouds. Experimental results show that the method performs well on some objects in S3DIS dataset and Toronto3d dataset.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"8 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adversarial imitation learning-based network for category-level 6D object pose estimation 基于对抗性模仿学习的网络，用于类别级 6D 物体姿态估计

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-12 DOI: 10.1007/s00138-024-01592-6

Shantong Sun, Xu Bao, Aryan Kaushik

引用次数: 0

Active perception based on deep reinforcement learning for autonomous robotic damage inspection 基于深度强化学习的主动感知，用于自主机器人损伤检测

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-12 DOI: 10.1007/s00138-024-01591-7

Wen Tang, Mohammad R. Jahanshahi

{"title":"Active perception based on deep reinforcement learning for autonomous robotic damage inspection","authors":"Wen Tang, Mohammad R. Jahanshahi","doi":"10.1007/s00138-024-01591-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01591-7","url":null,"abstract":"In this study, an artificial intelligence framework is developed to facilitate the use of robotics for autonomous damage inspection. While considerable progress has been achieved by utilizing state-of-the-art computer vision approaches for damage detection, these approaches are still far away from being used for autonomous robotic inspection systems due to the uncertainties in data collection and data interpretation. To address this gap, this study proposes a framework that will enable robots to select the best course of action for active damage perception and reduction of uncertainties. By doing so, the required information is collected efficiently for a better understanding of damage severity which leads to reliable decision-making. More specifically, the active damage perception task is formulated as a Partially Observable Markov Decision Process, and a deep reinforcement learning-based active perception agent is proposed to learn the near-optimal policy for this task. The proposed framework is evaluated for the autonomous assessment of cracks on metallic surfaces of an underwater nuclear reactor. Active perception exhibits a notable enhancement in the crack Intersection over Union (IoU) performance, yielding an increase of up to 69% when compared to its raster scanning counterpart given a similar inspection time. Additionally, the proposed method can perform a rapid inspection that reduces the overall inspection time by more than two times while achieving a 15% higher crack IoU than that of the dense raster scanning approach.\u0000","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"96 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An efficient ground segmentation approach for LiDAR point cloud utilizing adjacent grids 利用相邻网格对激光雷达点云进行高效地面分割的方法

IF 3.3 4区计算机科学

Machine Vision and Applications Pub Date : 2024-08-11 DOI: 10.1007/s00138-024-01593-5

Longyu Dong, Dejun Liu, Youqiang Dong, Bongrae Park, Zhibo Wan

引用次数: 0