Biomedical text-based detection of colon, lung, and thyroid cancer: A deep learning approach with novel dataset

IF 3.7 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2025-04-28 DOI:10.1016/j.displa.2025.103068

Kubilay Muhammed Sünnetci

{"title":"Biomedical text-based detection of colon, lung, and thyroid cancer: A deep learning approach with novel dataset","authors":"Kubilay Muhammed Sünnetci","doi":"10.1016/j.displa.2025.103068","DOIUrl":null,"url":null,"abstract":"<div><div>Pre-trained Language Models (PLMs) are widely used nowadays and increasingly popular. These models can be used to solve Natural Language Processing (NLP) challenges, and their focus on specific topics allows the models to provide answers to directly relevant issues. As a sub-branch of this, Biomedical Text Classification (BTC) is a fundamental task that can be used in various applications and is used to aid clinical decisions. Therefore, this study detects colon, lung, and thyroid cancer from biomedical texts. A dataset including 3070 biomedical texts is generated by artificial intelligence and used in the study. In this dataset, there are 1020 texts labeled colon cancer, while the number of samples labeled lung and thyroid cancer is equal to 1020 and 1030, respectively. In the study, 70 % of the data is used in the training set, while the remaining data is split for validation and test sets. After preprocessing all the data used in the study, word encoding is used to prepare the model inputs. Furthermore, these documents in the dataset are converted into sequences of numeric indices. Afterward, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM), LSTM+LSTM, GRU+GRU, BiLSTM+BiLSTM, and LSTM+GRU+BiLSTM architectures are trained with train and validation sets, and these models are tested with the test set. Both validation and test performances of all developed models are determined, and a Graphical User Interface (GUI) software is prepared in which the most successful architecture has been embedded. The results show that LSTM is the most successful model, and the accuracy and specificity values achieved by this model in the validation set are equal to 91.32 % and 95.67 %, respectively. The F1 score value achieved by this model for the validation set is also equal to 91.32 %. The accuracy, specificity, and F1 score values achieved by this model in the test set are equal to 85.87 %, 92.94 %, and 85.90 %, respectively. The sensitivity values achieved by this model for the validation and test set are 91.33 % and 85.88 %, respectively. These developed models both provide comparative results and have shown successful performances. Focusing these models on specific issues can provide more effective results for related problems. Furthermore, the presentation of a user-friendly GUI application developed in the study allows users to use the models effectively.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"89 ","pages":"Article 103068"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225001052","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Pre-trained Language Models (PLMs) are widely used nowadays and increasingly popular. These models can be used to solve Natural Language Processing (NLP) challenges, and their focus on specific topics allows the models to provide answers to directly relevant issues. As a sub-branch of this, Biomedical Text Classification (BTC) is a fundamental task that can be used in various applications and is used to aid clinical decisions. Therefore, this study detects colon, lung, and thyroid cancer from biomedical texts. A dataset including 3070 biomedical texts is generated by artificial intelligence and used in the study. In this dataset, there are 1020 texts labeled colon cancer, while the number of samples labeled lung and thyroid cancer is equal to 1020 and 1030, respectively. In the study, 70 % of the data is used in the training set, while the remaining data is split for validation and test sets. After preprocessing all the data used in the study, word encoding is used to prepare the model inputs. Furthermore, these documents in the dataset are converted into sequences of numeric indices. Afterward, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM), LSTM+LSTM, GRU+GRU, BiLSTM+BiLSTM, and LSTM+GRU+BiLSTM architectures are trained with train and validation sets, and these models are tested with the test set. Both validation and test performances of all developed models are determined, and a Graphical User Interface (GUI) software is prepared in which the most successful architecture has been embedded. The results show that LSTM is the most successful model, and the accuracy and specificity values achieved by this model in the validation set are equal to 91.32 % and 95.67 %, respectively. The F1 score value achieved by this model for the validation set is also equal to 91.32 %. The accuracy, specificity, and F1 score values achieved by this model in the test set are equal to 85.87 %, 92.94 %, and 85.90 %, respectively. The sensitivity values achieved by this model for the validation and test set are 91.33 % and 85.88 %, respectively. These developed models both provide comparative results and have shown successful performances. Focusing these models on specific issues can provide more effective results for related problems. Furthermore, the presentation of a user-friendly GUI application developed in the study allows users to use the models effectively.

查看原文本刊更多论文

基于文本的生物医学结肠癌、肺癌和甲状腺癌检测：一种具有新数据集的深度学习方法

预训练语言模型（PLMs）在当今得到了广泛的应用，并且越来越受欢迎。这些模型可用于解决自然语言处理（NLP）的挑战，它们对特定主题的关注使模型能够为直接相关的问题提供答案。作为其中的一个分支，生物医学文本分类（BTC）是一项基础任务，可用于各种应用，并用于辅助临床决策。因此，本研究从生物医学文献中检测结肠癌、肺癌和甲状腺癌。一个包含3070篇生物医学文献的数据集由人工智能生成并用于研究。在这个数据集中，有1020个文本标记为结肠癌，而标记为肺癌和甲状腺癌的样本数量分别等于1020和1030。在本研究中，70%的数据用于训练集，其余的数据用于验证集和测试集。在对研究中使用的所有数据进行预处理后，使用词编码来准备模型输入。此外，数据集中的这些文档被转换成数字索引序列。然后，用训练集和验证集对长短期记忆（LSTM）、门控循环单元（GRU）、双向LSTM （BiLSTM）、LSTM+LSTM、GRU+GRU、BiLSTM+BiLSTM和LSTM+GRU+BiLSTM架构进行训练，并用测试集对这些模型进行测试。确定了所有开发模型的验证和测试性能，并准备了一个图形用户界面（GUI）软件，其中嵌入了最成功的架构。结果表明，LSTM是最成功的模型，该模型在验证集中的准确率和特异性值分别为91.32%和95.67%。该模型对验证集的F1得分值也等于91.32%。该模型在测试集中获得的准确率为85.87%，特异性为92.94%，F1评分值为85.90%。该模型对验证集和测试集的灵敏度分别为91.33%和85.88%。这些开发的模型既提供了比较结果，也显示了成功的性能。将这些模型集中在具体问题上，可以为相关问题提供更有效的结果。此外，研究中开发的用户友好的GUI应用程序的呈现允许用户有效地使用模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.