Using Machine Learning Algorithms in Determining the Stage of Breast Cancer from Pathology Reports

Frontiers in health informatics Pub Date : 2024-02-01 DOI:10.30699/fhi.v13i0.519

Shirin Samadzad-Qushchi, Parinaz Eskandarian, Z. Niazkhani, Ali Rashidi, H. Pirnejad

{"title":"Using Machine Learning Algorithms in Determining the Stage of Breast Cancer from Pathology Reports","authors":"Shirin Samadzad-Qushchi, Parinaz Eskandarian, Z. Niazkhani, Ali Rashidi, H. Pirnejad","doi":"10.30699/fhi.v13i0.519","DOIUrl":null,"url":null,"abstract":"Introduction: After a cancer diagnosis, the most important thing is to determine the stage and grade of the cancer. Pathology reports are the main source for cancer staging, but they do not contain all the information needed for the staging. However, the text of these reports is sometimes the only available information. We were interested in knowing whether text mining methods can be used to predict staging only from pathology reports.Material and Methods: A total of 698 pathology reports of breast cancer cases and their TNM staging collected from multiple centers in West Azerbaijan Province, Iran were used for this study. After preparing the semi-structured reports, the texts of the reports were imported into a program written by Python V3. Three machine learning algorithms of Logistic Regression, SVM, and Naïve Bayes and a simple pipeline were used for the purpose of text mining. The performance of the algorithms was evaluated in terms of accuracy, precision, recall, and F1 score.Results: The Naïve Bayes algorithm achieved excellent results and a value rate of higher than 91% in all evaluation criteria (accuracy, precision, recall and F1 score). This means that the Naïve Bayes algorithm could classify the reports with high efficiency and its predictions were more correct than the other two algorithms. Naïve Bayes also outperformed SVM and Logistic Regression in terms of accuracy, recall and F1 score. In addition, Naïve-Bayes showed faster inference due to its simplicity and lower computational and training time.Conclusion: We suggest using the proposed design in this study for predicting breast cancer staging, where there is a need but not all necessary information except pathology reports. This method may not be a useful for clinical management of cancer patients, but it can be safely used for epidemiological estimations.","PeriodicalId":477354,"journal":{"name":"Frontiers in health informatics","volume":"2 19","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in health informatics","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.30699/fhi.v13i0.519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: After a cancer diagnosis, the most important thing is to determine the stage and grade of the cancer. Pathology reports are the main source for cancer staging, but they do not contain all the information needed for the staging. However, the text of these reports is sometimes the only available information. We were interested in knowing whether text mining methods can be used to predict staging only from pathology reports.Material and Methods: A total of 698 pathology reports of breast cancer cases and their TNM staging collected from multiple centers in West Azerbaijan Province, Iran were used for this study. After preparing the semi-structured reports, the texts of the reports were imported into a program written by Python V3. Three machine learning algorithms of Logistic Regression, SVM, and Naïve Bayes and a simple pipeline were used for the purpose of text mining. The performance of the algorithms was evaluated in terms of accuracy, precision, recall, and F1 score.Results: The Naïve Bayes algorithm achieved excellent results and a value rate of higher than 91% in all evaluation criteria (accuracy, precision, recall and F1 score). This means that the Naïve Bayes algorithm could classify the reports with high efficiency and its predictions were more correct than the other two algorithms. Naïve Bayes also outperformed SVM and Logistic Regression in terms of accuracy, recall and F1 score. In addition, Naïve-Bayes showed faster inference due to its simplicity and lower computational and training time.Conclusion: We suggest using the proposed design in this study for predicting breast cancer staging, where there is a need but not all necessary information except pathology reports. This method may not be a useful for clinical management of cancer patients, but it can be safely used for epidemiological estimations.

查看原文本刊更多论文

利用机器学习算法从病理报告中确定乳腺癌的分期

导读：癌症确诊后，最重要的是确定癌症的分期和分级：癌症确诊后，最重要的是确定癌症的分期和分级。病理报告是癌症分期的主要依据，但病理报告并不包含分期所需的全部信息。不过，这些报告的文本有时是唯一可用的信息。我们有兴趣了解文本挖掘方法是否可用于仅从病理报告中预测分期：本研究使用了从伊朗西阿塞拜疆省多个中心收集的共 698 份乳腺癌病理报告及其 TNM 分期。准备好半结构化报告后，将报告文本导入到 Python V3 编写的程序中。为实现文本挖掘的目的，使用了 Logistic Regression、SVM 和 Naïve Bayes 三种机器学习算法和一个简单的管道。从准确率、精确度、召回率和 F1 分数等方面对算法的性能进行了评估：结果：奈夫贝叶斯算法取得了优异的成绩，在所有评估标准（准确率、精确率、召回率和 F1 分数）中，其值率均高于 91%。这说明 Naïve Bayes 算法能高效地对报告进行分类，其预测的正确率高于其他两种算法。在准确率、召回率和 F1 分数方面，Naïve Bayes 也优于 SVM 和 Logistic Regression。此外，Naïve-Bayes 因其简单、计算和训练时间较短而显示出更快的推理速度：我们建议将本研究中提出的设计用于预测乳腺癌的分期，因为在这种情况下，除了病理报告外，还需要其他必要信息。这种方法可能无法用于癌症患者的临床管理，但可以安全地用于流行病学评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in health informatics

自引率

0.00%

发文量