LLM4THP：基于双层集合模型策略，通过大语言模型的分子和序列表示识别肿瘤归巢肽的计算工具

IF 2.4 3区生物学 Q3 BIOCHEMISTRY & MOLECULAR BIOLOGY

Amino Acids Pub Date : 2024-10-15 DOI:10.1007/s00726-024-03422-5

Sen Yang, Piao Xu

{"title":"LLM4THP：基于双层集合模型策略，通过大语言模型的分子和序列表示识别肿瘤归巢肽的计算工具","authors":"Sen Yang, Piao Xu","doi":"10.1007/s00726-024-03422-5","DOIUrl":null,"url":null,"abstract":"<div><p>Tumor homing peptides (THPs) have a distinctive capacity to specifically attach to tumor cells, providing a promising approach for targeted cancer treatment and detection. Although THPs have the potential for significant impact, their detection by conventional methods is both time-consuming and expensive. To tackle this issue, we provide LLM4THP, an innovative computational approach that utilizes large language models (LLMs) to quickly and effectively detect THPs. LLM4THP utilizes two protein LLMs, ESM2 and Prot_T5_XL_UniRef50, to encode peptide sequences. This allows for the capture of complex patterns and relationships within the peptide data. In addition, we utilize inherent sequence characteristics such as Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), Amphiphilic Pseudo Amino Acid Composition (APAAC), and Composition, Transition, and Distribution (CTD) to improve the representation of peptides. The RDKitDescriptors feature representation approach transforms peptide sequences into molecular objects and computes chemical characteristics, resulting in enhanced THP identification. The LLM4THP ensemble strategy incorporates various features into a two-layer learning architecture. The first layer consists of LightGBM, XGBoost, Random Forest, and Extremely Randomized Trees, which generate a set of meta results. The second layer utilizes Logistic Regression to further refine the identification of sequences as either THP or non-THP. LLM4THP exhibits exceptional performance compared to the most advanced methods, showcasing enhancements in accuracy, Matthew’s correlation coefficient, F1 score, area under the curve, and average precision. The source code and dataset can be accessed at the following URL: https://github.com/abcair/LLM4THP.</p></div>","PeriodicalId":7810,"journal":{"name":"Amino Acids","volume":"56 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s00726-024-03422-5.pdf","citationCount":"0","resultStr":"{\"title\":\"LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy\",\"authors\":\"Sen Yang, Piao Xu\",\"doi\":\"10.1007/s00726-024-03422-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Tumor homing peptides (THPs) have a distinctive capacity to specifically attach to tumor cells, providing a promising approach for targeted cancer treatment and detection. Although THPs have the potential for significant impact, their detection by conventional methods is both time-consuming and expensive. To tackle this issue, we provide LLM4THP, an innovative computational approach that utilizes large language models (LLMs) to quickly and effectively detect THPs. LLM4THP utilizes two protein LLMs, ESM2 and Prot_T5_XL_UniRef50, to encode peptide sequences. This allows for the capture of complex patterns and relationships within the peptide data. In addition, we utilize inherent sequence characteristics such as Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), Amphiphilic Pseudo Amino Acid Composition (APAAC), and Composition, Transition, and Distribution (CTD) to improve the representation of peptides. The RDKitDescriptors feature representation approach transforms peptide sequences into molecular objects and computes chemical characteristics, resulting in enhanced THP identification. The LLM4THP ensemble strategy incorporates various features into a two-layer learning architecture. The first layer consists of LightGBM, XGBoost, Random Forest, and Extremely Randomized Trees, which generate a set of meta results. The second layer utilizes Logistic Regression to further refine the identification of sequences as either THP or non-THP. LLM4THP exhibits exceptional performance compared to the most advanced methods, showcasing enhancements in accuracy, Matthew’s correlation coefficient, F1 score, area under the curve, and average precision. The source code and dataset can be accessed at the following URL: https://github.com/abcair/LLM4THP.</p></div>\",\"PeriodicalId\":7810,\"journal\":{\"name\":\"Amino Acids\",\"volume\":\"56 1\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s00726-024-03422-5.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Amino Acids\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s00726-024-03422-5\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Amino Acids","FirstCategoryId":"99","ListUrlMain":"https://link.springer.com/article/10.1007/s00726-024-03422-5","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

肿瘤归巢肽（THPs）具有特异性附着于肿瘤细胞的独特能力，为癌症靶向治疗和检测提供了一种前景广阔的方法。尽管肿瘤归巢肽有可能产生重大影响，但用传统方法检测肿瘤归巢肽既费时又昂贵。为了解决这个问题，我们提供了一种创新的计算方法 LLM4THP，它利用大型语言模型 (LLM) 快速有效地检测 THPs。LLM4THP 利用两个蛋白质 LLM（ESM2 和 Prot_T5_XL_UniRef50）来编码肽序列。这样就能捕捉到肽数据中的复杂模式和关系。此外，我们还利用氨基酸组成（AAC）、伪氨基酸组成（PAAC）、两亲性伪氨基酸组成（APAAC）以及组成、过渡和分布（CTD）等固有序列特征来改进肽的表示。RDKitDescriptors 特征表示方法将肽序列转换为分子对象并计算化学特征，从而增强了 THP 识别能力。LLM4THP 组合策略将各种特征纳入双层学习架构。第一层由 LightGBM、XGBoost、Random Forest 和 Extremely Randomized Trees 组成，可生成一组元结果。第二层利用逻辑回归进一步完善序列的 THP 或非 THP 识别。与最先进的方法相比，LLM4THP 表现出卓越的性能，在准确率、马太相关系数、F1 分数、曲线下面积和平均精确度方面都有提高。源代码和数据集可通过以下网址访问：https://github.com/abcair/LLM4THP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy

Tumor homing peptides (THPs) have a distinctive capacity to specifically attach to tumor cells, providing a promising approach for targeted cancer treatment and detection. Although THPs have the potential for significant impact, their detection by conventional methods is both time-consuming and expensive. To tackle this issue, we provide LLM4THP, an innovative computational approach that utilizes large language models (LLMs) to quickly and effectively detect THPs. LLM4THP utilizes two protein LLMs, ESM2 and Prot_T5_XL_UniRef50, to encode peptide sequences. This allows for the capture of complex patterns and relationships within the peptide data. In addition, we utilize inherent sequence characteristics such as Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), Amphiphilic Pseudo Amino Acid Composition (APAAC), and Composition, Transition, and Distribution (CTD) to improve the representation of peptides. The RDKitDescriptors feature representation approach transforms peptide sequences into molecular objects and computes chemical characteristics, resulting in enhanced THP identification. The LLM4THP ensemble strategy incorporates various features into a two-layer learning architecture. The first layer consists of LightGBM, XGBoost, Random Forest, and Extremely Randomized Trees, which generate a set of meta results. The second layer utilizes Logistic Regression to further refine the identification of sequences as either THP or non-THP. LLM4THP exhibits exceptional performance compared to the most advanced methods, showcasing enhancements in accuracy, Matthew’s correlation coefficient, F1 score, area under the curve, and average precision. The source code and dataset can be accessed at the following URL: https://github.com/abcair/LLM4THP.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Amino Acids 生物-生化与分子生物学

CiteScore

6.40

自引率

5.70%

发文量

审稿时长

2.2 months

期刊介绍： Amino Acids publishes contributions from all fields of amino acid and protein research: analysis, separation, synthesis, biosynthesis, cross linking amino acids, racemization/enantiomers, modification of amino acids as phosphorylation, methylation, acetylation, glycosylation and nonenzymatic glycosylation, new roles for amino acids in physiology and pathophysiology, biology, amino acid analogues and derivatives, polyamines, radiated amino acids, peptides, stable isotopes and isotopes of amino acids. Applications in medicine, food chemistry, nutrition, gastroenterology, nephrology, neurochemistry, pharmacology, excitatory amino acids are just some of the topics covered. Fields of interest include: Biochemistry, food chemistry, nutrition, neurology, psychiatry, pharmacology, nephrology, gastroenterology, microbiology