Zunair Safdar , Jinfang Sheng , Muhammad Usman Saeed , Muhammad Ramzan , A. Al-Zubaidi
{"title":"Empowering cardiovascular diagnostics with SET-MobileNet: A lightweight and accurate deep learning based classification approach","authors":"Zunair Safdar , Jinfang Sheng , Muhammad Usman Saeed , Muhammad Ramzan , A. Al-Zubaidi","doi":"10.1016/j.imavis.2025.105684","DOIUrl":null,"url":null,"abstract":"<div><div>Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, necessitating early detection and accurate diagnosis for improved patient outcomes. This study introduces SET-MobileNet, a lightweight deep learning model designed for automated heart sound classification, integrating transformers to capture long-range dependencies and squeeze-and-excitation (SE) blocks to emphasize relevant acoustic features while suppressing noise artifacts. Unlike traditional methods that rely on handcrafted features, SET-MobileNet employs a multimodal feature extraction approach, incorporating log-mel spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), chroma features, and zero-crossing rates to enhance classification robustness. The model is evaluated across multiple publicly available heart sound datasets, including CirCor, HSS, GitHub, and Heartbeat Sounds, achieving a state-of-the-art accuracy of 99.95% for 2.0-second heart sound segments in the CirCor dataset. Extensive experiments demonstrate that multimodal feature representations significantly improve classification performance by capturing both time-frequency and spectral characteristics of heart sounds. SET-MobileNet is computationally efficient, with a model size of 8.61 MB and single-sample inference times under 6.5 ms, making it suitable for real-time deployment on mobile and embedded devices. Ablation studies confirm the contributions of transformers and SE blocks, showing incremental improvements in accuracy and noise suppression.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105684"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002720","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, necessitating early detection and accurate diagnosis for improved patient outcomes. This study introduces SET-MobileNet, a lightweight deep learning model designed for automated heart sound classification, integrating transformers to capture long-range dependencies and squeeze-and-excitation (SE) blocks to emphasize relevant acoustic features while suppressing noise artifacts. Unlike traditional methods that rely on handcrafted features, SET-MobileNet employs a multimodal feature extraction approach, incorporating log-mel spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), chroma features, and zero-crossing rates to enhance classification robustness. The model is evaluated across multiple publicly available heart sound datasets, including CirCor, HSS, GitHub, and Heartbeat Sounds, achieving a state-of-the-art accuracy of 99.95% for 2.0-second heart sound segments in the CirCor dataset. Extensive experiments demonstrate that multimodal feature representations significantly improve classification performance by capturing both time-frequency and spectral characteristics of heart sounds. SET-MobileNet is computationally efficient, with a model size of 8.61 MB and single-sample inference times under 6.5 ms, making it suitable for real-time deployment on mobile and embedded devices. Ablation studies confirm the contributions of transformers and SE blocks, showing incremental improvements in accuracy and noise suppression.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.