Integrating NLP and LLMs to discover biomarkers and mechanisms in Alzheimer's disease

IF 3.7 4区医学 Q3 BIOCHEMICAL RESEARCH METHODS

SLAS Technology Pub Date : 2025-02-21 DOI:10.1016/j.slast.2025.100257

JinTao Song, JunJie Huang, RuiLi Liu

{"title":"Integrating NLP and LLMs to discover biomarkers and mechanisms in Alzheimer's disease","authors":"JinTao Song, JunJie Huang, RuiLi Liu","doi":"10.1016/j.slast.2025.100257","DOIUrl":null,"url":null,"abstract":"<div><div>Alzheimer's disease (AD) is a progressive neurological condition characterized by cognitive decline, memory loss, and aberrant behaviour. It affects millions of people globally and is one of the main causes of dementia. The neurodegenerative condition known as AD has intricate, multifaceted mechanisms that make it difficult to comprehend and identify in its early stages. Conventional diagnostic techniques frequently fail to detect the disease in its early stages. By combining Natural Language Processing (NLP) and Large Language Models (LLMs), this research suggests a novel approach for identifying potential biomarkers and underlying mechanisms of AD. Clinical data is gathered from publicly accessible databases and healthcare facilities, including genetic information, neuroimaging scans, and medical records. The pre-processing of unstructured clinical notes involves tokenization and genetic profiles and neuroimaging data are normalized by Z-score normalization for consistency. Multi-Input Convolutional Neural Networks (MI-CNN) are employed to efficiently fuse diverse data sources, allowing for a thorough analysis. Key biomarkers linked to AD are identified and categorized using the Genetic Algorithm combined with Bidirectional Encoder Representations from Transformers (BERT) (GenBERT). By fine-tuning BERT's hyperparameters using genetic optimization approaches, GenBERT enables the effective analysis of large medical datasets, such as patient histories, genetic data, and clinical notes. The combination strategy increases feature selection and the model's capacity to identify minute genomic and linguistic patterns suggestive of AD. The goal of this integrated strategy is to provide early diagnostic tools and new insights into the pathogenesis of the disease, which could transform methods for detecting and treating AD. As it concerns early AD prediction, the GenBERT model performs better than current techniques, obtaining the highest accuracy (98.30%) and F1-score (0.97), as well as greater precision (0.95) and recall (0.92). Additionally, it demonstrates its capacity to reliably identify both positive and negative AD cases with sensitivity (98.65%) and specificity (99.73%). Overall, GenBERT offers a trustworthy and useful tool for AD early diagnosis.</div></div>","PeriodicalId":54248,"journal":{"name":"SLAS Technology","volume":"31 ","pages":"Article 100257"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SLAS Technology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2472630325000159","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Alzheimer's disease (AD) is a progressive neurological condition characterized by cognitive decline, memory loss, and aberrant behaviour. It affects millions of people globally and is one of the main causes of dementia. The neurodegenerative condition known as AD has intricate, multifaceted mechanisms that make it difficult to comprehend and identify in its early stages. Conventional diagnostic techniques frequently fail to detect the disease in its early stages. By combining Natural Language Processing (NLP) and Large Language Models (LLMs), this research suggests a novel approach for identifying potential biomarkers and underlying mechanisms of AD. Clinical data is gathered from publicly accessible databases and healthcare facilities, including genetic information, neuroimaging scans, and medical records. The pre-processing of unstructured clinical notes involves tokenization and genetic profiles and neuroimaging data are normalized by Z-score normalization for consistency. Multi-Input Convolutional Neural Networks (MI-CNN) are employed to efficiently fuse diverse data sources, allowing for a thorough analysis. Key biomarkers linked to AD are identified and categorized using the Genetic Algorithm combined with Bidirectional Encoder Representations from Transformers (BERT) (GenBERT). By fine-tuning BERT's hyperparameters using genetic optimization approaches, GenBERT enables the effective analysis of large medical datasets, such as patient histories, genetic data, and clinical notes. The combination strategy increases feature selection and the model's capacity to identify minute genomic and linguistic patterns suggestive of AD. The goal of this integrated strategy is to provide early diagnostic tools and new insights into the pathogenesis of the disease, which could transform methods for detecting and treating AD. As it concerns early AD prediction, the GenBERT model performs better than current techniques, obtaining the highest accuracy (98.30%) and F1-score (0.97), as well as greater precision (0.95) and recall (0.92). Additionally, it demonstrates its capacity to reliably identify both positive and negative AD cases with sensitivity (98.65%) and specificity (99.73%). Overall, GenBERT offers a trustworthy and useful tool for AD early diagnosis.

查看原文本刊更多论文

整合NLP和llm发现阿尔茨海默病的生物标志物和机制。

阿尔茨海默病（AD）是一种进行性神经系统疾病，其特征是认知能力下降、记忆丧失和行为异常。它影响着全球数百万人，是痴呆症的主要原因之一。被称为AD的神经退行性疾病具有复杂的、多方面的机制，使其在早期阶段难以理解和识别。传统的诊断技术常常不能在早期发现这种疾病。通过将自然语言处理（NLP）和大型语言模型（LLMs）相结合，本研究提出了一种识别AD潜在生物标志物和潜在机制的新方法。临床数据收集自可公开访问的数据库和医疗机构，包括遗传信息、神经成像扫描和医疗记录。非结构化临床记录的预处理涉及标记化，遗传谱和神经影像学数据通过z分数归一化进行规范化以保持一致性。采用多输入卷积神经网络（MI-CNN）有效地融合不同的数据源，允许进行彻底的分析。与AD相关的关键生物标志物使用遗传算法结合变形金刚（BERT）（GenBERT）的双向编码器表示进行识别和分类。通过使用遗传优化方法微调BERT的超参数，GenBERT能够有效地分析大型医疗数据集，如患者病史、遗传数据和临床记录。这种组合策略增加了特征选择和模型识别提示AD的微小基因组和语言模式的能力。这一综合策略的目标是提供早期诊断工具和对疾病发病机制的新见解，这可能会改变阿尔茨海默病的检测和治疗方法。在早期AD预测方面，GenBERT模型表现优于现有技术，准确率最高（98.30%），f1得分最高（0.97），准确率更高（0.95），召回率更高（0.92）。此外，该方法能够可靠地识别AD阳性和阴性病例，灵敏度（98.65%）和特异性（99.73%）。总的来说，GenBERT为AD的早期诊断提供了一个值得信赖和有用的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SLAS Technology Computer Science-Computer Science Applications

CiteScore

6.30

自引率

7.40%

发文量

审稿时长

106 days

期刊介绍： SLAS Technology emphasizes scientific and technical advances that enable and improve life sciences research and development; drug-delivery; diagnostics; biomedical and molecular imaging; and personalized and precision medicine. This includes high-throughput and other laboratory automation technologies; micro/nanotechnologies; analytical, separation and quantitative techniques; synthetic chemistry and biology; informatics (data analysis, statistics, bio, genomic and chemoinformatics); and more.