Thyroid cancer is one of the most common cancers in clinical practice, and accurate classification of thyroid nodule ultrasound images is crucial for computer-aided diagnosis. Models based on a convolutional neural network (CNN) or a transformer struggle to integrate local and global features, which impacts the recognition accuracy.
Our method is designed to capture both the key local fine-grained features and the global spatial features essential for thyroid nodule diagnosis simultaneously. It adapts to the irregular morphology of thyroid nodules, dynamically focuses on the key pixel-level regions of thyroid nodules, and thereby improves the model's recognition accuracy and generalization ability.
The proposed multi-scale fusion model, the local and global feature fusion network (LGF-Net), inspired by the dual-path mechanism of human visual diagnosis, consists of two branches: a CNN branch and a Transformer branch. The CNN branch integrates the wavelet transform and deformable convolution module (WTDCM) to enhance the model's ability to capture discriminative local features and recognize fine-grained textures. By introducing the aggregated attention (AA) mechanism, which mimics biological vision, into the Transformer branch, spatial features are effectively captured. The adaptive feature fusion module (FFM) is then utilized to integrate the multi-scale features of thyroid nodules, further improving classification performance.
We evaluated our model on the public thyroid nodule classification dataset (TNCD) and a private clinical dataset using accuracy, recall, precision, and F1-score. On TNCD, the model achieved 81.50%, 79.51%, 79.92%, and 79.70%, respectively. On the private dataset, it reached 91.24%, 88.90%, 90.73%, and 89.73%, respectively. These results outperformed state-of-the-art methods. We also conducted ablation studies and visualization analysis to validate the model's components and interpretability.
The experiments demonstrate that our method improves the accuracy of thyroid nodule recognition, shows its strong generalization ability and potential for clinical application, and provides interpretability for clinicians' diagnoses.