基于机器学习的甲状腺结节分类筛选方法——解决甲状腺结节数据不平衡挑战

IF 1.4 Q3 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Sajad Khodabandelu, Naser Ghaemian, Soraya Khafri, Mehdi Ezoji, Sara Khaleghi
{"title":"基于机器学习的甲状腺结节分类筛选方法——解决甲状腺结节数据不平衡挑战","authors":"Sajad Khodabandelu,&nbsp;Naser Ghaemian,&nbsp;Soraya Khafri,&nbsp;Mehdi Ezoji,&nbsp;Sara Khaleghi","doi":"10.34172/jrhs.2022.90","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study aims to show the impact of imbalanced data and the typical evaluation methods in developing and misleading assessments of machine learning-based models for preoperative thyroid nodules screening.</p><p><strong>Study design: </strong>A retrospective study.</p><p><strong>Methods: </strong>The ultrasonography features for 431 thyroid nodules cases were extracted from medical records of 313 patients in Babol, Iran. Since thyroid nodules are commonly benign, the relevant data are usually unbalanced in classes. It can lead to the bias of learning models toward the majority class. To solve it, a hybrid resampling method called the Smote-was used to creating balance data. Following that, the support vector classification (SVC) algorithm was trained by balance and unbalanced datasets as Models 2 and 3, respectively, in Python language programming. Their performance was then compared with the logistic regression model as Model 1 that fitted traditionally.</p><p><strong>Results: </strong>The prevalence of malignant nodules was obtained at 14% (n = 61). In addition, 87% of the patients in this study were women. However, there was no difference in the prevalence of malignancy for gender. Furthermore, the accuracy, area under the curve, and geometric mean values were estimated at 92.1%, 93.2%, and 76.8% for Model 1, 91.3%, 93%, and 77.6% for Model 2, and finally, 91%, 92.6% and 84.2% for Model 3, respectively. Similarly, the results identified Micro calcification, Taller than wide shape, as well as lack of ISO and hyperechogenicity features as the most effective malignant variables.</p><p><strong>Conclusion: </strong>Paying attention to data challenges, such as data imbalances, and using proper criteria measures can improve the performance of machine learning models for preoperative thyroid nodules screening.</p>","PeriodicalId":17164,"journal":{"name":"Journal of research in health sciences","volume":null,"pages":null},"PeriodicalIF":1.4000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422153/pdf/","citationCount":"1","resultStr":"{\"title\":\"Development of a Machine Learning-Based Screening Method for Thyroid Nodules Classification by Solving the Imbalance Challenge in Thyroid Nodules Data.\",\"authors\":\"Sajad Khodabandelu,&nbsp;Naser Ghaemian,&nbsp;Soraya Khafri,&nbsp;Mehdi Ezoji,&nbsp;Sara Khaleghi\",\"doi\":\"10.34172/jrhs.2022.90\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>This study aims to show the impact of imbalanced data and the typical evaluation methods in developing and misleading assessments of machine learning-based models for preoperative thyroid nodules screening.</p><p><strong>Study design: </strong>A retrospective study.</p><p><strong>Methods: </strong>The ultrasonography features for 431 thyroid nodules cases were extracted from medical records of 313 patients in Babol, Iran. Since thyroid nodules are commonly benign, the relevant data are usually unbalanced in classes. It can lead to the bias of learning models toward the majority class. To solve it, a hybrid resampling method called the Smote-was used to creating balance data. Following that, the support vector classification (SVC) algorithm was trained by balance and unbalanced datasets as Models 2 and 3, respectively, in Python language programming. Their performance was then compared with the logistic regression model as Model 1 that fitted traditionally.</p><p><strong>Results: </strong>The prevalence of malignant nodules was obtained at 14% (n = 61). In addition, 87% of the patients in this study were women. However, there was no difference in the prevalence of malignancy for gender. Furthermore, the accuracy, area under the curve, and geometric mean values were estimated at 92.1%, 93.2%, and 76.8% for Model 1, 91.3%, 93%, and 77.6% for Model 2, and finally, 91%, 92.6% and 84.2% for Model 3, respectively. Similarly, the results identified Micro calcification, Taller than wide shape, as well as lack of ISO and hyperechogenicity features as the most effective malignant variables.</p><p><strong>Conclusion: </strong>Paying attention to data challenges, such as data imbalances, and using proper criteria measures can improve the performance of machine learning models for preoperative thyroid nodules screening.</p>\",\"PeriodicalId\":17164,\"journal\":{\"name\":\"Journal of research in health sciences\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2022-10-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422153/pdf/\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of research in health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.34172/jrhs.2022.90\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of research in health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34172/jrhs.2022.90","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 1

摘要

背景:本研究旨在展示数据不平衡和典型评估方法对基于机器学习的甲状腺结节术前筛查模型的开发和误导性评估的影响。研究设计:回顾性研究。方法:从伊朗巴博勒市313例甲状腺结节患者病历中提取431例甲状腺结节的声像图特征。由于甲状腺结节通常是良性的,所以分类的相关数据通常是不平衡的。这可能会导致学习模式偏向大多数班级。为了解决这个问题,使用了一种称为smote的混合重采样方法来创建平衡数据。然后,在Python语言编程中分别以平衡数据集和非平衡数据集作为模型2和模型3来训练支持向量分类(SVC)算法。然后将其性能与传统拟合的逻辑回归模型1进行比较。结果:恶性结节的患病率为14% (n = 61)。此外,本研究中87%的患者是女性。然而,在恶性肿瘤的患病率上没有性别差异。模型1的准确率、曲线下面积和几何平均值分别为92.1%、93.2%和76.8%,模型2的准确率分别为91.3%、93%和77.6%,模型3的准确率分别为91%、92.6%和84.2%。同样,结果确定了微钙化,高于宽的形状,以及缺乏ISO和高回声特征是最有效的恶性变量。结论:关注数据不平衡等数据挑战,并采用适当的标准措施,可以提高机器学习模型在术前甲状腺结节筛查中的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Development of a Machine Learning-Based Screening Method for Thyroid Nodules Classification by Solving the Imbalance Challenge in Thyroid Nodules Data.

Background: This study aims to show the impact of imbalanced data and the typical evaluation methods in developing and misleading assessments of machine learning-based models for preoperative thyroid nodules screening.

Study design: A retrospective study.

Methods: The ultrasonography features for 431 thyroid nodules cases were extracted from medical records of 313 patients in Babol, Iran. Since thyroid nodules are commonly benign, the relevant data are usually unbalanced in classes. It can lead to the bias of learning models toward the majority class. To solve it, a hybrid resampling method called the Smote-was used to creating balance data. Following that, the support vector classification (SVC) algorithm was trained by balance and unbalanced datasets as Models 2 and 3, respectively, in Python language programming. Their performance was then compared with the logistic regression model as Model 1 that fitted traditionally.

Results: The prevalence of malignant nodules was obtained at 14% (n = 61). In addition, 87% of the patients in this study were women. However, there was no difference in the prevalence of malignancy for gender. Furthermore, the accuracy, area under the curve, and geometric mean values were estimated at 92.1%, 93.2%, and 76.8% for Model 1, 91.3%, 93%, and 77.6% for Model 2, and finally, 91%, 92.6% and 84.2% for Model 3, respectively. Similarly, the results identified Micro calcification, Taller than wide shape, as well as lack of ISO and hyperechogenicity features as the most effective malignant variables.

Conclusion: Paying attention to data challenges, such as data imbalances, and using proper criteria measures can improve the performance of machine learning models for preoperative thyroid nodules screening.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of research in health sciences
Journal of research in health sciences PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-
CiteScore
2.30
自引率
13.30%
发文量
7
期刊介绍: The Journal of Research in Health Sciences (JRHS) is the official journal of the School of Public Health; Hamadan University of Medical Sciences, which is published quarterly. Since 2017, JRHS is published electronically. JRHS is a peer-reviewed, scientific publication which is produced quarterly and is a multidisciplinary journal in the field of public health, publishing contributions from Epidemiology, Biostatistics, Public Health, Occupational Health, Environmental Health, Health Education, and Preventive and Social Medicine. We do not publish clinical trials, nursing studies, animal studies, qualitative studies, nutritional studies, health insurance, and hospital management. In addition, we do not publish the results of laboratory and chemical studies in the field of ergonomics, occupational health, and environmental health
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信