{"title":"RBS-YOLO: A Lightweight YOLOv5-Based Surface Defect Detection Model for Castings","authors":"KeZhu Wu, ShaoMing Sun, YiNing Sun, CunYi Wang, YiFan Wei","doi":"10.1049/ipr2.70018","DOIUrl":"https://doi.org/10.1049/ipr2.70018","url":null,"abstract":"<p>To ensure precise and rapid identification of casting surface defects and to support the subsequent realisation of high-precision grinding, this study introduces a method for detecting casting surface defects using a lightweight YOLOv5 framework. The enhanced model integrates the ShuffleNetV2 high-efficiency CNN architecture into the YOLOv5 foundation, substantially reducing network parameters to achieve a lightweight model. Additionally, the Convolutional Block Attention Module (CBAM) attention mechanism is incorporated to enhance the model's capability to detect defects. The ReLU activation function replaces the SiLU function in the convolutional layer, decreasing the computational load and boosting efficiency. Subsequently, the optimised model is quantised and implemented on the RV1126 embedded development board, successfully performing image inference. To validate the effectiveness of the proposed method, a dataset of casting surface defects was designed and constructed. The optimised model has a file size of 7.6 MB, representing 55.4% of the original model, with about 50.6% of the original model's parameters. The onboard inference speed of the improved model is 50 ms per image, which is 9.1% faster than the traditional YOLOv5 model. These results offer valuable insights for future casting surface defect detection technologies.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143475454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DRSE-YOLO: Efficient and Lightweight Architecture for Accurate Waste Detection","authors":"Guangling Sun, Fenqi Zhang","doi":"10.1049/ipr2.70022","DOIUrl":"https://doi.org/10.1049/ipr2.70022","url":null,"abstract":"<p>This paper introduces DRSE-YOLO, an efficient waste detection model designed to address detection accuracy and lightweight design challenges. The RCCA module in the model's neck enhances multi-scale feature representation, thereby improving detection performance. The DySample module optimizes upsampling through adaptive point-sampling, reducing computational demands and improving resource efficiency. The Slim-Neck module is applied to select convolutional layers and C2f modules to streamline the model and enhance computational efficiency. The ECC-Head integrates asymmetric depth convolution, point convolution, and an attention mechanism, balancing accuracy with reduced parameters and computational load. Evaluated on a custom dataset comprising 46 waste classes and approximately 25,000 images, DRSE-YOLO achieves significant improvements over YOLOv8n, including a higher [email protected] (+1.59%) and [email protected]:95 (+2.08%), alongside a reduced parameter count (2.43 M vs. 3.2 M) and GFLOPs (5.8 vs. 8.2, a 24.4% reduction). These results underscore DRSE-YOLO's efficiency and accuracy.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143456065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Yu, Hongqing Zhu, Tianwei Qian, Tong Hou, Bingcang Huang
{"title":"Multi-Task Collaboration for Cross-Modal Generation and Multi-Modal Ophthalmic Diseases Diagnosis","authors":"Yang Yu, Hongqing Zhu, Tianwei Qian, Tong Hou, Bingcang Huang","doi":"10.1049/ipr2.70016","DOIUrl":"https://doi.org/10.1049/ipr2.70016","url":null,"abstract":"<p>Multi-modal diagnosis of ophthalmic disease is becoming increasingly important because combining multi-modal data allows for more accurate diagnosis. Color fundus photograph (CFP) and optical coherence tomography (OCT) are commonly used as two non-invasive modalities for ophthalmic examination. However, the diagnosis of each modality is not entirely accurate. Compounding the challenge is the difficulty in acquiring multi-modal data, with existing datasets frequently lacking paired multi-modal data. To solve these problems, we propose multi-modal distribution fusion diagnostic algorithm and cross-modal generation algorithm. The multi-modal distribution fusion diagnostic algorithm first calculates the mean and variance separately for each modality, and then generates multi-modal diagnostic results in a distribution fusion manner. In order to generate the absent modality (mainly OCT data), three sub-networks are designed in the cross-modal generation algorithm: cross-modal alignment network, conditional deformable autoencoder and latent consistency diffusion model (LCDM). Finally, we propose multi-task collaboration strategy where diagnosis and generation tasks are mutually reinforcing to achieve optimal performance. Experimental results demonstrate that our proposed method yield superior results compared to state-of-the-arts.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OMA-SSR: Optical-guided multi-kernel attention based SAR image super-resolution reconstruction network","authors":"Yanshan Li, Fan Xu","doi":"10.1049/ipr2.70008","DOIUrl":"https://doi.org/10.1049/ipr2.70008","url":null,"abstract":"<p>Synthetic aperture radar (SAR) has been widely studied and applied in many fields. Although image super-resolution technology has been successfully applied to SAR imaging in recent years, there is less research on large-scale factor SAR image super-resolution methods. A more effective method is to obtain comprehensive information to guide the reconstruction of SAR images. In fact, the co-registered characteristics of high-resolution optical images have been successfully applied to improve the quality of SAR images. Inspired by this, an optical-guided multi-kernel attention based SAR image super-resolution reconstruction network (OMA-SSR) is proposed. The proposed multi-modal mutual attention (MMA) module in this network can effectively establish the dependency between SAR image features and optical image features. This network also designs a deep feature extraction module for SAR images, which includes a channel-splitted multi-kernel attention (CSMA) module and residual connections. CSMA module splits SAR image channels, extracts features in different ranges through multi-kernel convolution, and finally fuses the extracted features between different channels. Experimental results on the Sen1-2 and QXS datasets show that the proposed OMA-SSR performs well in evaluation indicators and visual effects of SAR image super-resolution reconstruction.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TLBP: Tomography-Aided Local Binary Patterns With High Discrimination for Image Classification","authors":"Yichen Liu, Xin Zhang, Yanan Jiang, Chunlei Zhang, Hanlin Feng","doi":"10.1049/ipr2.70015","DOIUrl":"https://doi.org/10.1049/ipr2.70015","url":null,"abstract":"<p>Local binary patterns (LBP) play a vital role in image classification as a computationally efficient feature descriptor. A crucial reason for its limitation of discriminability is the lack of neighbourhood information description from a global perspective. Previous research has attempted to improve its performance by introducing global thresholds, but such threshold selection is not optimal. To address this issue, we propose a novel tomography-aided local binary patterns (TLBP), inspired by the tomographic process of sample separation. TLBP considers constructing visual feature representations under multi-level non-local information to compensate for the lack of LBP possessing only a single shallow feature. In addition to the basic LBP features from local visual context, TLBP captures refined neighbourhood greyscale information through multi-quantile thresholds from a global visual perspective, thereby greatly enhancing discriminability. Experimental results in texture classification, face recognition, and hyperspectral pixel-wise classification demonstrate that the proposed TLBP descriptor outperforms the competitors, achieving 94.39% (KTH-TIPS), 81.22% (KTH-TIPS-ROT), 93.81% (Indian Pines), 99.85% (Salinas), and 99.50% (ORL) accuracy. Furthermore, the performance of the T-variants that apply the tomographic idea to classic LBP descriptors improve significantly, especially for their rotation-invariant versions.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70015","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Shi, Shuai Ren, Xing Fan, Ke Wang, Shan Lin, Zhanwen Liu
{"title":"Modified You Only Look Once Network Model for Enhanced Traffic Scene Detection Performance for Small Targets","authors":"Lei Shi, Shuai Ren, Xing Fan, Ke Wang, Shan Lin, Zhanwen Liu","doi":"10.1049/ipr2.70014","DOIUrl":"https://doi.org/10.1049/ipr2.70014","url":null,"abstract":"<p>In order to address the challenge of small target recognition in traffic scenes, we propose a model based on you only look once version 8X (Yolov8X) network model, which has been combined with receptive fields block (RFB) and multidimensional collaborative attention (MCA). First, the model employs the RFB to extract reliable and distinctive features, thereby enhancing the precision of small target identification. Furthermore, the MCA structure is introduced to simulate multidimensional attention through three parallel branches, thereby enhancing the feature expression ability of the model. This fragment describes a compression transformation and an excitation transformation that captures the differentiated feature representation of the command. These transformations facilitate the network's ability to locate and predict the location of small objects more accurately. Utilizing these transformations enhances the expressiveness and diversity of features, thereby improving the detection performance of small objects. Furthermore, data augmentation and hyperparameter optimization techniques are employed to enhance the model's generalisability. The validation results on the Argoverse 1.1 autonomous driving dataset demonstrate that the enhanced network model outperforms the prevailing detectors, achieving an F1 score of 78.6, an average precision of 55.1, and an average recall of 72.4. The algorithm's excellent performance for small target detection was demonstrated through visual analysis, proving its high application value and potential for promotion in fields such as autonomous driving.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70014","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-cropping contrastive learning and domain consistency for unsupervised image-to-image translation","authors":"Chen Zhao, Wei-Ling Cai, Zheng Yuan, Cheng-Wei Hu","doi":"10.1049/ipr2.70006","DOIUrl":"https://doi.org/10.1049/ipr2.70006","url":null,"abstract":"<p>Recently, unsupervised image-to-image (i2i) translation methods based on contrastive learning have achieved state-of-the-art results. However, in previous works, the negatives are sampled from the input image itself, which inspires us to design a data augmentation method to improve the quality of the selected negatives. Moreover, the previous methods only preserve the content consistency via patch-wise contrastive learning, which ignores the domain consistency between the generated images and the real images of the target domain. This paper proposes a novel unsupervised i2i translation framework based on multi-cropping contrastive learning and domain consistency, called MCDUT. Specifically, the multi-cropping views are obtained with the aim of further generating high-quality negative examples. To constrain the embeddings in the deep feature space, a new domain consistency loss is formulated, which encourages the generated images to be close to the real images. In many i2i translation tasks, this method achieves state-of-the-art results, and the advantages of this method have been proven through extensive comparison experiments and ablation research. The code of MCDUT is available at https://github.com/zhihefang/MCDUT.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70006","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143404452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahuan Jiang, Dongsheng Zhou, Muzhen He, Xiaohan Yue, Shu Zhang
{"title":"GSA-Net: Global Spatial Structure-Aware Attention Network for Liver Segmentation in MR Images With Respiratory Artifacts","authors":"Jiahuan Jiang, Dongsheng Zhou, Muzhen He, Xiaohan Yue, Shu Zhang","doi":"10.1049/ipr2.70010","DOIUrl":"https://doi.org/10.1049/ipr2.70010","url":null,"abstract":"<p>Automatic liver segmentation is of great significance for computer-aided treatment and surgery of liver diseases. However, respiratory motion often affects the liver, leading to image artifacts in liver magnetic resonance imaging (MRI) and increasing segmentation difficulty. To overcome this issue, we propose a global spatial structure-aware attention model (GSA-Net), a robust segmentation network developed to overcome the difficulties caused by respiratory motion. The GSA-Net is an encoder-decoder architecture, which extracts spatial structure information from images and identifies different objects using the minimum spanning tree algorithm. The network's encoder extracts multi-scale image features with the help of an effective and lightweight channel attention module. The decoder then transforms these features bottom-up using tree filter modules. Combined with the boundary detection module, the segmentation performance can be further improved. We evaluate the effectiveness of our method on two liver MRI benchmarks: one with respiratory artifacts and the other without. Numerical evaluations on different benchmarks demonstrate that GSA-Net consistently outperforms previous state-of-the-art models in terms of segmentation precision on our respiratory artifact dataset, and also achieves notable results on high-quality datasets.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143389101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NewTalker: Exploring frequency domain for speech-driven 3D facial animation with Mamba","authors":"Weiran Niu, Zan Wang, Yi Li, Tangtang Lou","doi":"10.1049/ipr2.70011","DOIUrl":"https://doi.org/10.1049/ipr2.70011","url":null,"abstract":"<p>In the current field of speech-driven 3D facial animation, transformer-based methods are limited in practical applications due to their high computational complexity. A new model—NewTalker—is proposed, which has core modules consisting of the residual bidirectional Mamba (RBM) and the time–frequency domain Kolmogorov–Arnold networks (TFK). The RBM module incorporates the philosophy of Mamba, enhancing the model's predictive ability for sequence data by utilizing both past and future contextual information, thereby reducing the computational complexity. The TFK module integrates the temporal and frequency domain information of audio data through Kolmogorov–Arnold networks, allowing the model to generate 3D facial animations smoothly while learning more detailed features. Extensive experiments and user studies have shown that the proposed NewTalker significantly surpasses current mainstream algorithms in terms of animation quality and inference speed, achieving the state-of-the-art level in this domain.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research on tea buds detection based on optimized YOLOv5s","authors":"Guanli Li, Jianqiang Lu, Dong Zhang, Zhongyi Guo","doi":"10.1049/ipr2.13319","DOIUrl":"https://doi.org/10.1049/ipr2.13319","url":null,"abstract":"<p>As one of the world's most popular beverages, tea plays a significant role in improving tea production efficiency and quality through the identification of tea shoots during the tea manufacturing process. However, due to the complex morphology, small size, and susceptibility to factors like lighting and obstruction, traditional identification methods suffer from low accuracy and efficiency. In this study, image enhancement techniques such as HSV transformation, horizontal flipping, and vertical flipping were applied to the training dataset to improve model robustness and enhance generalization across varying lighting and angles. To address these challenges in the context of tea buds detection, deep-learning-based object detection methods have emerged as promising solutions. Nevertheless, current object detection technologies still face limitations when detecting tea buds under these conditions. To enhance identification performance, this article proposed an improved YOLOv5s (You Only Look Once version 5 small model) algorithm. In the improved YOLOv5s algorithm, CBAM, SE, and CA attention mechanisms were incorporated into the backbone network to augment feature extraction, and a weighted Bidirectional Feature Pyramid Network (BiFPN) is employed in the neck network to boost performance, resulting in the YOLOv5s_teabuds model. Experimental results indicated that the improved model significantly outperformed the original in terms of precision, recall, mAP and F1-score, with the CA attention mechanism providing the most notable improvement—enhancing precision, recall, mAP and F1-score by 18.119%, 9.633%, 16.496% and 13.524%, respectively. After integrating BiFPN, the YOLOv5s_teabuds model further strengthened performance and robustness, with precision, recall, mAP and F1-score increased by 19.346%, 11.388%, 18.620%, and 15.059%, respectively. Experimental results prove that the optimized YOLOv5s model can provide a real-time, high-precision tea buds detection method for robotic harvesting.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.13319","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143248414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}