基于pca的bert层间和层内的阴阳稀释来代表抗冠状病毒肽

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Kewei Li , Shiying Ding , Zhe Guo , Yusi Fan , Hongmei Liu , Yannan Sun , Gongyou Zhang , Ruochi Zhang , Lan Huang , Fengfeng Zhou
{"title":"基于pca的bert层间和层内的阴阳稀释来代表抗冠状病毒肽","authors":"Kewei Li ,&nbsp;Shiying Ding ,&nbsp;Zhe Guo ,&nbsp;Yusi Fan ,&nbsp;Hongmei Liu ,&nbsp;Yannan Sun ,&nbsp;Gongyou Zhang ,&nbsp;Ruochi Zhang ,&nbsp;Lan Huang ,&nbsp;Fengfeng Zhou","doi":"10.1016/j.eswa.2025.127786","DOIUrl":null,"url":null,"abstract":"<div><div>Numerous studies have demonstrated that biological sequences, such as DNA, RNA, and peptide, can be considered the “language of life”. Utilizing pre-trained language models (LMs) like ESM2, GPT, and BERT have yielded state-of-the-art (SOTA) results in many cases. However, the increasing size of datasets exponentially escalates the time and hardware resources required for fine-tuning a complete LM. This paper assumed that natural language shared linguistic logic with the “language of life” like peptides. We took the LM BERT model as an example in a novel Principal Component Analysis (PCA)-based Ying-Yang dilution network of the inter- and intra-BERT layers, termed TaiChiNet, for feature representation of peptide sequences. The Ying-Yang dilution architecture fuses the PCA transformation matrices trained on positive and negative samples, respectively. We transferred the TaiChiNet features into a subtractive layer feature space and observed that TaiChiNet just rotated the original subtractive features with a certain angle and didn’t change the relative distance among the dimensions. TaiChiNet-engineered features together with the hand-crafted (HC) ones were integrated for the prediction model of anti-coronavirus peptides (TaiChiACVP). Experimental results demonstrated that the TaiChiACVP model achieved new SOTA performance and remarkably short training time on five imbalanced datasets established for the anti-coronavirus peptide (ACVP) prediction task. The decision paths of the random forest classifier illustrated that TaiChiNet features can complement HC features for better decisions. TaiChiNet has also learned the latent features significantly correlated with physicochemical properties including molecular weight. This makes an explainable connection between the deep learning-represented features and the ACVP-associated physicochemical properties. Additionally, we extended our work to the other LMs, including ESM2 with 6 and 12 layers, ProGen2 small and base version, ProtBERT, and ProtGPT2. Due to the limitations of these recent LMs, none of them outperforms TaiChiACVP. However, some limitations of TaiChiNet remained to be investigated in the future, including learnable rotation degrees, extended fusions of more layers, and end-to-end training architecture. The source code is freely available at: <span><span>http://www.healthinformaticslab.org/supp/resources.php</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"282 ","pages":"Article 127786"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TaiChiNet: PCA-based Ying-Yang dilution of inter- and intra-BERT layers to represent anti-coronavirus peptides\",\"authors\":\"Kewei Li ,&nbsp;Shiying Ding ,&nbsp;Zhe Guo ,&nbsp;Yusi Fan ,&nbsp;Hongmei Liu ,&nbsp;Yannan Sun ,&nbsp;Gongyou Zhang ,&nbsp;Ruochi Zhang ,&nbsp;Lan Huang ,&nbsp;Fengfeng Zhou\",\"doi\":\"10.1016/j.eswa.2025.127786\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Numerous studies have demonstrated that biological sequences, such as DNA, RNA, and peptide, can be considered the “language of life”. Utilizing pre-trained language models (LMs) like ESM2, GPT, and BERT have yielded state-of-the-art (SOTA) results in many cases. However, the increasing size of datasets exponentially escalates the time and hardware resources required for fine-tuning a complete LM. This paper assumed that natural language shared linguistic logic with the “language of life” like peptides. We took the LM BERT model as an example in a novel Principal Component Analysis (PCA)-based Ying-Yang dilution network of the inter- and intra-BERT layers, termed TaiChiNet, for feature representation of peptide sequences. The Ying-Yang dilution architecture fuses the PCA transformation matrices trained on positive and negative samples, respectively. We transferred the TaiChiNet features into a subtractive layer feature space and observed that TaiChiNet just rotated the original subtractive features with a certain angle and didn’t change the relative distance among the dimensions. TaiChiNet-engineered features together with the hand-crafted (HC) ones were integrated for the prediction model of anti-coronavirus peptides (TaiChiACVP). Experimental results demonstrated that the TaiChiACVP model achieved new SOTA performance and remarkably short training time on five imbalanced datasets established for the anti-coronavirus peptide (ACVP) prediction task. The decision paths of the random forest classifier illustrated that TaiChiNet features can complement HC features for better decisions. TaiChiNet has also learned the latent features significantly correlated with physicochemical properties including molecular weight. This makes an explainable connection between the deep learning-represented features and the ACVP-associated physicochemical properties. Additionally, we extended our work to the other LMs, including ESM2 with 6 and 12 layers, ProGen2 small and base version, ProtBERT, and ProtGPT2. Due to the limitations of these recent LMs, none of them outperforms TaiChiACVP. However, some limitations of TaiChiNet remained to be investigated in the future, including learnable rotation degrees, extended fusions of more layers, and end-to-end training architecture. The source code is freely available at: <span><span>http://www.healthinformaticslab.org/supp/resources.php</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"282 \",\"pages\":\"Article 127786\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425014083\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425014083","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

大量的研究表明,生物序列,如DNA、RNA和肽,可以被认为是“生命的语言”。在许多情况下,利用ESM2、GPT和BERT等预训练语言模型(LMs)已经产生了最先进的(SOTA)结果。然而,随着数据集规模的增加,微调完整LM所需的时间和硬件资源也呈指数级增长。本文假设自然语言与多肽等“生命语言”共享语言逻辑。我们以LM BERT模型为例,在新的基于主成分分析(PCA)的BERT层之间和内部的Ying-Yang稀释网络(称为TaiChiNet)中用于肽序列的特征表示。Ying-Yang稀释架构融合了分别在正样本和负样本上训练的PCA变换矩阵。我们将TaiChiNet的特征转移到一个减层特征空间中,观察到TaiChiNet只是将原有的减层特征以一定的角度旋转,并没有改变维度之间的相对距离。将人工合成(HC)特征与人工合成(TaiChiACVP)特征整合到抗冠状病毒肽(TaiChiACVP)预测模型中。实验结果表明,TaiChiACVP模型在抗冠状病毒肽(ACVP)预测任务中建立的5个不平衡数据集上取得了新的SOTA性能和显著缩短的训练时间。随机森林分类器的决策路径表明,TaiChiNet特征可以补充HC特征,以获得更好的决策。taiichinet还了解了与分子量等理化性质显著相关的潜在特征。这使得深度学习表示的特征与acvp相关的物理化学性质之间存在可解释的联系。此外,我们将我们的工作扩展到其他lm,包括6层和12层的ESM2, ProGen2小版本和基本版本,ProtBERT和ProtGPT2。由于这些最近的LMs的局限性,它们都没有超过TaiChiACVP。然而,TaiChiNet的一些局限性仍有待研究,包括可学习的旋转度、更多层的扩展融合和端到端训练架构。源代码可以在http://www.healthinformaticslab.org/supp/resources.php免费获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
TaiChiNet: PCA-based Ying-Yang dilution of inter- and intra-BERT layers to represent anti-coronavirus peptides
Numerous studies have demonstrated that biological sequences, such as DNA, RNA, and peptide, can be considered the “language of life”. Utilizing pre-trained language models (LMs) like ESM2, GPT, and BERT have yielded state-of-the-art (SOTA) results in many cases. However, the increasing size of datasets exponentially escalates the time and hardware resources required for fine-tuning a complete LM. This paper assumed that natural language shared linguistic logic with the “language of life” like peptides. We took the LM BERT model as an example in a novel Principal Component Analysis (PCA)-based Ying-Yang dilution network of the inter- and intra-BERT layers, termed TaiChiNet, for feature representation of peptide sequences. The Ying-Yang dilution architecture fuses the PCA transformation matrices trained on positive and negative samples, respectively. We transferred the TaiChiNet features into a subtractive layer feature space and observed that TaiChiNet just rotated the original subtractive features with a certain angle and didn’t change the relative distance among the dimensions. TaiChiNet-engineered features together with the hand-crafted (HC) ones were integrated for the prediction model of anti-coronavirus peptides (TaiChiACVP). Experimental results demonstrated that the TaiChiACVP model achieved new SOTA performance and remarkably short training time on five imbalanced datasets established for the anti-coronavirus peptide (ACVP) prediction task. The decision paths of the random forest classifier illustrated that TaiChiNet features can complement HC features for better decisions. TaiChiNet has also learned the latent features significantly correlated with physicochemical properties including molecular weight. This makes an explainable connection between the deep learning-represented features and the ACVP-associated physicochemical properties. Additionally, we extended our work to the other LMs, including ESM2 with 6 and 12 layers, ProGen2 small and base version, ProtBERT, and ProtGPT2. Due to the limitations of these recent LMs, none of them outperforms TaiChiACVP. However, some limitations of TaiChiNet remained to be investigated in the future, including learnable rotation degrees, extended fusions of more layers, and end-to-end training architecture. The source code is freely available at: http://www.healthinformaticslab.org/supp/resources.php.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信