Daniel Rodriguez-Criado, Pilar Bachiller-Burgos, George Vogiatzis, Luis J. Manso
{"title":"Multi-person 3D pose estimation from unlabelled data","authors":"Daniel Rodriguez-Criado, Pilar Bachiller-Burgos, George Vogiatzis, Luis J. Manso","doi":"10.1007/s00138-024-01530-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01530-6","url":null,"abstract":"<p>Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"USIR-Net: sand-dust image restoration based on unsupervised learning","authors":"Yuan Ding, Kaijun Wu","doi":"10.1007/s00138-024-01528-0","DOIUrl":"https://doi.org/10.1007/s00138-024-01528-0","url":null,"abstract":"<p>In sand-dust weather, the influence of sand-dust particles on imaging equipment often results in images with color deviation, blurring, and low contrast, among other issues. These problems making many traditional image restoration methods unable to accurately estimate the semantic information of the images and consequently resulting in poor restoration of clear images. Most current image restoration methods in the field of deep learning are based on supervised learning, which requires pairing and labeling a large amount of data, and the possibility of manual annotation errors. In light of this, we propose an unsupervised sand-dust image restoration network. The overall model adopts an improved CycleGAN to fit unpaired sand-dust images. Firstly, multiscale skip connections in the multiscale cascaded attention module are used to enhance the feature fusion effect after downsampling. Secondly, multi-head convolutional attention with multiple input concatenations is employed, with each head using different kernel sizes to improve the ability to restore detail information. Finally, the adaptive decoder-encoder module is used to achieve adaptive fitting of the model and output the restored image. According to the experiments conducted on the dataset, the qualitative and quantitative indicators of USIR-Net are superior to the selected comparison algorithms, furthermore, in additional experiments conducted on haze removal and underwater image enhancement, we have demonstrated the wide applicability of our model.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"94 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140587365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ssman: self-supervised masked adaptive network for 3D human pose estimation","authors":"","doi":"10.1007/s00138-024-01514-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01514-6","url":null,"abstract":"<h3>Abstract</h3> <p>The modern deep learning-based models for 3D human pose estimation from monocular images always lack the adaption ability between occlusion and non-occlusion scenarios, which might restrict the performance of current methods when faced with various scales of occluded conditions. In an attempt to tackle this problem, we propose a novel network called self-supervised masked adaptive network (SSMAN). Firstly, we leverage different levels of masks to cover the richness of occlusion in fully in-the-wild environment. Then, we design a multi-line adaptive network, which could be trained with various scales of masked images in parallel. Based on this masked adaptive network, we train it with self-supervised learning to enforce the consistency across the outputs under different mask ratios. Furthermore, a global refinement module is proposed to leverage global features of the human body to refine the human pose estimated solely by local features. We perform extensive experiments both on the occlusion datasets like 3DPW-OCC and OCHuman and general datasets such as Human3.6M and 3DPW. The results show that SSMAN achieves new state-of-the-art performance on both lightly and heavily occluded benchmarks and is highly competitive with significant improvement on standard benchmarks.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"6 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoqiang Wang, Lan Li, Min Zhu, Rui Zhao, Xiang Zhang
{"title":"Kernel based local matching network for video object segmentation","authors":"Guoqiang Wang, Lan Li, Min Zhu, Rui Zhao, Xiang Zhang","doi":"10.1007/s00138-024-01524-4","DOIUrl":"https://doi.org/10.1007/s00138-024-01524-4","url":null,"abstract":"<p>Recently, the methods based on space-time memory network have achieved advanced performance in semi-supervised video object segmentation, which has attracted wide attention. However, this kind of methods still have a fatal limitation. It has the interference problem of similar objects caused by the way of non-local matching, which seriously limits the performance of video object segmentation. To solve this problem, we propose a Kernel-guided Attention Matching Network (KAMNet) by the use of local matching instead of non-local matching. At first, KAMNet uses spatio-temporal attention mechanism to enhance the model’s discrimination between foreground objects and background areas. Then KAMNet utilizes gaussian kernel to guide the matching between the current frame and the reference set. Because the gaussian kernel decays away from the center, it can limit the matching to the central region, thus achieving local matching. Our KAMNet gets speed-accuracy trade-off on benchmark datasets DAVIS 2016 (<span>( mathcal {J & F})</span> of 87.6%) and DAVIS 2017 (<span>( mathcal {J & F})</span> of 76.0%) with 0.12 second per frame.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"46 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Bojanić, Kristijan Bartol, Josep Forest, Tomislav Petković, Tomislav Pribanić
{"title":"Addressing the generalization of 3D registration methods with a featureless baseline and an unbiased benchmark","authors":"David Bojanić, Kristijan Bartol, Josep Forest, Tomislav Petković, Tomislav Pribanić","doi":"10.1007/s00138-024-01510-w","DOIUrl":"https://doi.org/10.1007/s00138-024-01510-w","url":null,"abstract":"<p>Recent 3D registration methods are mostly learning-based that either find correspondences in feature space and match them, or directly estimate the registration transformation from the given point cloud features. Therefore, these feature-based methods have difficulties with generalizing onto point clouds that differ substantially from their training data. This issue is not so apparent because of the problematic benchmark definitions that cannot provide any in-depth analysis and contain a bias toward similar data. Therefore, we propose a methodology to create a 3D registration benchmark, given a point cloud dataset, that provides a more informative evaluation of a method w.r.t. other benchmarks. Using this methodology, we create a novel FAUST-partial (FP) benchmark, based on the FAUST dataset, with several difficulty levels. The FP benchmark addresses the limitations of the current benchmarks: lack of data and parameter range variability, and allows to evaluate the strengths and weaknesses of a 3D registration method w.r.t. a single registration parameter. Using the new FP benchmark, we provide a thorough analysis of the current state-of-the-art methods and observe that the current method still struggle to generalize onto severely different out-of-sample data. Therefore, we propose a simple featureless traditional 3D registration baseline method based on the weighted cross-correlation between two given point clouds. Our method achieves strong results on current benchmarking datasets, outperforming most deep learning methods. Our source code is available on github.com/DavidBoja/exhaustive-grid-search.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2015 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang, Zehao Pan, Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang, Yanhui Wang
{"title":"AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection","authors":"Bingli Zhang, Yixin Wang, Chengbiao Zhang, Junzhao Jiang, Zehao Pan, Jin Cheng, Yangyang Zhang, Xinyu Wang, Chenglei Yang, Yanhui Wang","doi":"10.1007/s00138-024-01509-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01509-3","url":null,"abstract":"<p>Lidar and camera are essential sensors for environment perception in autonomous driving. However, fully fusing heterogeneous data from multiple sources remains a non-trivial challenge. As a result, 3D object detection based on multi-modal sensor fusion are often inferior to single-modal methods only based on Lidar, which indicates that multi-sensor machine vision still needs development. In this paper, we propose an adaptive fusion module based on cross-modal transformer block(AFMCT) for 3D object detection by utilizing a bidirectional enhancing strategy. Specifically, we first enhance image feature by extracting an attention-based point feature based on a cross-modal transformer block and linking them in a concatenation fashion, followed by another cross-modal transformer block acting on the enhanced image feature to strengthen the point feature with image semantic information. Extensive experiments operated on the 3D detection benchmark of the KITTI dataset reveal that our proposed structure can significantly improve the detection accuracy of Lidar-only methods and outperform the existing advanced multi-sensor fusion modules by at least 0.45%, which indicates that our method might be a feasible solution to improving 3D object detection based on multi-sensor fusion.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"30 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Loran Cheplanov, Shai Avidan, David J. Bonfil, Iftach Klapp
{"title":"Hyperspectral image dynamic range reconstruction using deep neural network-based denoising methods","authors":"Loran Cheplanov, Shai Avidan, David J. Bonfil, Iftach Klapp","doi":"10.1007/s00138-024-01523-5","DOIUrl":"https://doi.org/10.1007/s00138-024-01523-5","url":null,"abstract":"<p>Hyperspectral (HS) measurement is among the most useful tools in agriculture for early disease detection. However, the cost of HS cameras that can perform the desired detection tasks is prohibitive-typically fifty thousand to hundreds of thousands of dollars. In a previous study at the Agricultural Research Organization’s Volcani Institute (Israel), a low-cost, high-performing HS system was developed which included a point spectrometer and optical components. Its main disadvantage was long shooting time for each image. Shooting time strongly depends on the predetermined integration time of the point spectrometer. While essential for performing monitoring tasks in a reasonable time, shortening integration time from a typical value in the range of 200 ms to the 10 ms range results in deterioration of the dynamic range of the captured scene. In this work, we suggest correcting this by learning the transformation from data measured with short integration time to that measured with long integration time. Reduction of the dynamic range and consequent low SNR were successfully overcome using three developed deep neural networks models based on a denoising auto-encoder, DnCNN and LambdaNetworks architectures as a backbone. The best model was based on DnCNN using a combined loss function of <span>(ell _{2})</span> and Kullback–Leibler divergence on images with 20 consecutive channels. The full spectrum of the model achieved a mean PSNR of 30.61 and mean SSIM of 0.9, showing total improvement relatively to the 10 ms measurements’ mean PSNR and mean SSIM values by 60.43% and 94.51%, respectively.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"25 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ecenur Oğuz, Yalım Doğan, Uğur Güdükbay, Oya Karaşan, Mustafa Pınar
{"title":"Point cloud registration with quantile assignment","authors":"Ecenur Oğuz, Yalım Doğan, Uğur Güdükbay, Oya Karaşan, Mustafa Pınar","doi":"10.1007/s00138-024-01517-3","DOIUrl":"https://doi.org/10.1007/s00138-024-01517-3","url":null,"abstract":"<p>Point cloud registration is a fundamental problem in computer vision. The problem encompasses critical tasks such as feature estimation, correspondence matching, and transformation estimation. The point cloud registration problem can be cast as a quantile matching problem. We refined the quantile assignment algorithm by integrating prevalent feature descriptors and transformation estimation methods to enhance the correspondence between the source and target point clouds. We evaluated the performances of these descriptors and methods with our approach through controlled experiments on a dataset we constructed using well-known 3D models. This systematic investigation led us to identify the most suitable methods for complementing our approach. Subsequently, we devised a new end-to-end, coarse-to-fine pairwise point cloud registration framework. Finally, we tested our framework on indoor and outdoor benchmark datasets and compared our results with state-of-the-art point cloud registration methods.\u0000</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"40 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhou, Chuanlin Liu, Amit Yadav, Sami Azam, Asif Karim
{"title":"An image quality assessment method based on edge extraction and singular value for blurriness","authors":"Lei Zhou, Chuanlin Liu, Amit Yadav, Sami Azam, Asif Karim","doi":"10.1007/s00138-024-01522-6","DOIUrl":"https://doi.org/10.1007/s00138-024-01522-6","url":null,"abstract":"<p>The automatic assessment of perceived image quality is crucial in the field of image processing. To achieve this idea, we propose an image quality assessment (IQA) method for blurriness. The features of gradient and singular value were extracted in this method instead of the single feature in the traditional IQA algorithms. According to the insufficient size of existing public image quality assessment datasets to support deep learning, machine learning was introduced to fuse the features of multiple domains, and a new no-reference (NR) IQA method for blurriness denoted Feature fusion IQA(Ffu-IQA) was proposed. The Ffu-IQA uses a probabilistic model to estimate the probability of each edge detection blur in the image, and then uses machine learning to aggregate the probability information to obtain the edge quality score. After that uses the singular value obtained by singular value decomposition of the image matrix to calculate the singular value score. Finally, machine learning pooling is used to obtain the true quality score. Ffu-IQA achieves PLCC scores of 0.9570 and 0.9616 on CSIQ and TID2013, respectively, and SROCC scores of 0.9380 and 0.9531, which are better than most traditional image quality assessment methods for blurriness.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"6 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal teacher with masked transformers for semi-supervised action proposal generation","authors":"Selen Pehlivan, Jorma Laaksonen","doi":"10.1007/s00138-024-01521-7","DOIUrl":"https://doi.org/10.1007/s00138-024-01521-7","url":null,"abstract":"<p>By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"67 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}