Exploring synthetic datasets for computer-aided detection: a case study using phantom scan data for enhanced lung nodule false positive reduction.

IF 1.9 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Journal of Medical Imaging Pub Date : 2024-07-01 Epub Date: 2024-08-07 DOI:10.1117/1.JMI.11.4.044507

Mohammad Mehdi Farhangi, Michael Maynord, Cornelia Fermüller, Yiannis Aloimonos, Berkman Sahiner, Nicholas Petrick

{"title":"Exploring synthetic datasets for computer-aided detection: a case study using phantom scan data for enhanced lung nodule false positive reduction.","authors":"Mohammad Mehdi Farhangi, Michael Maynord, Cornelia Fermüller, Yiannis Aloimonos, Berkman Sahiner, Nicholas Petrick","doi":"10.1117/1.JMI.11.4.044507","DOIUrl":null,"url":null,"abstract":"Purpose: Synthetic datasets hold the potential to offer cost-effective alternatives to clinical data, ensuring privacy protections and potentially addressing biases in clinical data. We present a method leveraging such datasets to train a machine learning algorithm applied as part of a computer-aided detection (CADe) system.Approach: Our proposed approach utilizes clinically acquired computed tomography (CT) scans of a physical anthropomorphic phantom into which manufactured lesions were inserted to train a machine learning algorithm. We treated the training database obtained from the anthropomorphic phantom as a simplified representation of clinical data and increased the variability in this dataset using a set of randomized and parameterized augmentations. Furthermore, to mitigate the inherent differences between phantom and clinical datasets, we investigated adding unlabeled clinical data into the training pipeline.Results: We apply our proposed method to the false positive reduction stage of a lung nodule CADe system in CT scans, in which regions of interest containing potential lesions are classified as nodule or non-nodule regions. Experimental results demonstrate the effectiveness of the proposed method; the system trained on labeled data from physical phantom scans and unlabeled clinical data achieves a sensitivity of 90% at eight false positives per scan. Furthermore, the experimental results demonstrate the benefit of the physical phantom in which the performance in terms of competitive performance metric increased by 6% when a training set consisting of 50 clinical CT scans was enlarged by the scans obtained from the physical phantom.Conclusions: The scalability of synthetic datasets can lead to improved CADe performance, particularly in scenarios in which the size of the labeled clinical data is limited or subject to inherent bias. Our proposed approach demonstrates an effective utilization of synthetic datasets for training machine learning algorithms.","PeriodicalId":47707,"journal":{"name":"Journal of Medical Imaging","volume":"11 4","pages":"044507"},"PeriodicalIF":1.9000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304989/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1117/1.JMI.11.4.044507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/7 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Synthetic datasets hold the potential to offer cost-effective alternatives to clinical data, ensuring privacy protections and potentially addressing biases in clinical data. We present a method leveraging such datasets to train a machine learning algorithm applied as part of a computer-aided detection (CADe) system.

Approach: Our proposed approach utilizes clinically acquired computed tomography (CT) scans of a physical anthropomorphic phantom into which manufactured lesions were inserted to train a machine learning algorithm. We treated the training database obtained from the anthropomorphic phantom as a simplified representation of clinical data and increased the variability in this dataset using a set of randomized and parameterized augmentations. Furthermore, to mitigate the inherent differences between phantom and clinical datasets, we investigated adding unlabeled clinical data into the training pipeline.

Results: We apply our proposed method to the false positive reduction stage of a lung nodule CADe system in CT scans, in which regions of interest containing potential lesions are classified as nodule or non-nodule regions. Experimental results demonstrate the effectiveness of the proposed method; the system trained on labeled data from physical phantom scans and unlabeled clinical data achieves a sensitivity of 90% at eight false positives per scan. Furthermore, the experimental results demonstrate the benefit of the physical phantom in which the performance in terms of competitive performance metric increased by 6% when a training set consisting of 50 clinical CT scans was enlarged by the scans obtained from the physical phantom.

Conclusions: The scalability of synthetic datasets can lead to improved CADe performance, particularly in scenarios in which the size of the labeled clinical data is limited or subject to inherent bias. Our proposed approach demonstrates an effective utilization of synthetic datasets for training machine learning algorithms.

查看原文本刊更多论文

探索用于计算机辅助检测的合成数据集：使用幻影扫描数据增强肺结节假阳性降低的案例研究。

目的：合成数据集有可能为临床数据提供具有成本效益的替代品，确保隐私得到保护，并有可能解决临床数据的偏差问题。我们提出了一种利用此类数据集训练机器学习算法的方法，该算法作为计算机辅助检测（CADe）系统的一部分应用：我们提出的方法利用临床获得的计算机断层扫描（CT）扫描物理拟人模型，在模型中插入人造病灶来训练机器学习算法。我们将从拟人模型中获得的训练数据库视为临床数据的简化表示，并使用一组随机化和参数化的增强功能来增加该数据集的可变性。此外，为了减少模型数据集和临床数据集之间的固有差异，我们还研究了在训练管道中添加未标记的临床数据的方法：我们将所提出的方法应用于 CT 扫描中肺部结节 CADe 系统的减少假阳性阶段，其中包含潜在病变的感兴趣区被分类为结节或非结节区域。实验结果证明了所提方法的有效性；根据物理模型扫描的标记数据和未标记的临床数据训练的系统，在每次扫描出现 8 个假阳性的情况下，灵敏度达到了 90%。此外，实验结果还证明了物理模型的优势，当一个由 50 个临床 CT 扫描组成的训练集被从物理模型中获得的扫描数据扩大时，在性能指标方面的表现提高了 6%：合成数据集的可扩展性可以提高 CADe 的性能，尤其是在标注的临床数据规模有限或存在固有偏差的情况下。我们提出的方法展示了如何有效利用合成数据集来训练机器学习算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Imaging RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

4.10

自引率

4.20%

发文量

期刊介绍： JMI covers fundamental and translational research, as well as applications, focused on medical imaging, which continue to yield physical and biomedical advancements in the early detection, diagnostics, and therapy of disease as well as in the understanding of normal. The scope of JMI includes: Imaging physics, Tomographic reconstruction algorithms (such as those in CT and MRI), Image processing and deep learning, Computer-aided diagnosis and quantitative image analysis, Visualization and modeling, Picture archiving and communications systems (PACS), Image perception and observer performance, Technology assessment, Ultrasonic imaging, Image-guided procedures, Digital pathology, Biomedical applications of biomedical imaging. JMI allows for the peer-reviewed communication and archiving of scientific developments, translational and clinical applications, reviews, and recommendations for the field.