{"title":"Biomedical text-based detection of colon, lung, and thyroid cancer: A deep learning approach with novel dataset","authors":"Kubilay Muhammed Sünnetci","doi":"10.1016/j.displa.2025.103068","DOIUrl":null,"url":null,"abstract":"<div><div>Pre-trained Language Models (PLMs) are widely used nowadays and increasingly popular. These models can be used to solve Natural Language Processing (NLP) challenges, and their focus on specific topics allows the models to provide answers to directly relevant issues. As a sub-branch of this, Biomedical Text Classification (BTC) is a fundamental task that can be used in various applications and is used to aid clinical decisions. Therefore, this study detects colon, lung, and thyroid cancer from biomedical texts. A dataset including 3070 biomedical texts is generated by artificial intelligence and used in the study. In this dataset, there are 1020 texts labeled colon cancer, while the number of samples labeled lung and thyroid cancer is equal to 1020 and 1030, respectively. In the study, 70 % of the data is used in the training set, while the remaining data is split for validation and test sets. After preprocessing all the data used in the study, word encoding is used to prepare the model inputs. Furthermore, these documents in the dataset are converted into sequences of numeric indices. Afterward, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM), LSTM+LSTM, GRU+GRU, BiLSTM+BiLSTM, and LSTM+GRU+BiLSTM architectures are trained with train and validation sets, and these models are tested with the test set. Both validation and test performances of all developed models are determined, and a Graphical User Interface (GUI) software is prepared in which the most successful architecture has been embedded. The results show that LSTM is the most successful model, and the accuracy and specificity values achieved by this model in the validation set are equal to 91.32 % and 95.67 %, respectively. The F1 score value achieved by this model for the validation set is also equal to 91.32 %. The accuracy, specificity, and F1 score values achieved by this model in the test set are equal to 85.87 %, 92.94 %, and 85.90 %, respectively. The sensitivity values achieved by this model for the validation and test set are 91.33 % and 85.88 %, respectively. These developed models both provide comparative results and have shown successful performances. Focusing these models on specific issues can provide more effective results for related problems. Furthermore, the presentation of a user-friendly GUI application developed in the study allows users to use the models effectively.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"89 ","pages":"Article 103068"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225001052","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Pre-trained Language Models (PLMs) are widely used nowadays and increasingly popular. These models can be used to solve Natural Language Processing (NLP) challenges, and their focus on specific topics allows the models to provide answers to directly relevant issues. As a sub-branch of this, Biomedical Text Classification (BTC) is a fundamental task that can be used in various applications and is used to aid clinical decisions. Therefore, this study detects colon, lung, and thyroid cancer from biomedical texts. A dataset including 3070 biomedical texts is generated by artificial intelligence and used in the study. In this dataset, there are 1020 texts labeled colon cancer, while the number of samples labeled lung and thyroid cancer is equal to 1020 and 1030, respectively. In the study, 70 % of the data is used in the training set, while the remaining data is split for validation and test sets. After preprocessing all the data used in the study, word encoding is used to prepare the model inputs. Furthermore, these documents in the dataset are converted into sequences of numeric indices. Afterward, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM), LSTM+LSTM, GRU+GRU, BiLSTM+BiLSTM, and LSTM+GRU+BiLSTM architectures are trained with train and validation sets, and these models are tested with the test set. Both validation and test performances of all developed models are determined, and a Graphical User Interface (GUI) software is prepared in which the most successful architecture has been embedded. The results show that LSTM is the most successful model, and the accuracy and specificity values achieved by this model in the validation set are equal to 91.32 % and 95.67 %, respectively. The F1 score value achieved by this model for the validation set is also equal to 91.32 %. The accuracy, specificity, and F1 score values achieved by this model in the test set are equal to 85.87 %, 92.94 %, and 85.90 %, respectively. The sensitivity values achieved by this model for the validation and test set are 91.33 % and 85.88 %, respectively. These developed models both provide comparative results and have shown successful performances. Focusing these models on specific issues can provide more effective results for related problems. Furthermore, the presentation of a user-friendly GUI application developed in the study allows users to use the models effectively.
期刊介绍:
Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface.
Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.