利用多知识表示学习蛋白质语言对比模型

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-10-25 DOI:10.1016/j.future.2024.107580

Wenjun Xu , Yingchun Xia , Bifan Sun , Zihao Zhao , Lianggui Tang , Xiaobo Zhou , Qingyong Wang , Lichuan Gu

{"title":"利用多知识表示学习蛋白质语言对比模型","authors":"Wenjun Xu , Yingchun Xia , Bifan Sun , Zihao Zhao , Lianggui Tang , Xiaobo Zhou , Qingyong Wang , Lichuan Gu","doi":"10.1016/j.future.2024.107580","DOIUrl":null,"url":null,"abstract":"<div><div>Protein representation learning plays a crucial role in obtaining a comprehensive understanding of biological regulatory mechanisms and in developing proteins and drugs for therapeutic purposes. However, labeled proteins, such as sequenced and functionally annotated data, are incomplete and few. Thus, contrastive learning has emerged as the preferred technique for learning meaningful representations from unlabeled data samples. In addition, at present, natural proteins cannot be fully described by extracting protein knowledge from a single domain. Therefore, Pro-CoRL, a <u>pro</u>tein <u>co</u>ntrastive models framework based on multi-knowledge <u>r</u>epresentation <u>l</u>earning, was proposed in this study. In particular, Pro-CoRL smooths the objective function using convex approximation, thereby improving the stability of training. Extensive experiments on predicting protein–protein interaction types and clustering protein families have confirmed the high accuracy and robustness of Pro-CoRL.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"164 ","pages":"Article 107580"},"PeriodicalIF":6.2000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning protein language contrastive models with multi-knowledge representation\",\"authors\":\"Wenjun Xu , Yingchun Xia , Bifan Sun , Zihao Zhao , Lianggui Tang , Xiaobo Zhou , Qingyong Wang , Lichuan Gu\",\"doi\":\"10.1016/j.future.2024.107580\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Protein representation learning plays a crucial role in obtaining a comprehensive understanding of biological regulatory mechanisms and in developing proteins and drugs for therapeutic purposes. However, labeled proteins, such as sequenced and functionally annotated data, are incomplete and few. Thus, contrastive learning has emerged as the preferred technique for learning meaningful representations from unlabeled data samples. In addition, at present, natural proteins cannot be fully described by extracting protein knowledge from a single domain. Therefore, Pro-CoRL, a <u>pro</u>tein <u>co</u>ntrastive models framework based on multi-knowledge <u>r</u>epresentation <u>l</u>earning, was proposed in this study. In particular, Pro-CoRL smooths the objective function using convex approximation, thereby improving the stability of training. Extensive experiments on predicting protein–protein interaction types and clustering protein families have confirmed the high accuracy and robustness of Pro-CoRL.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"164 \",\"pages\":\"Article 107580\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2024-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X24005442\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005442","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质表征学习在全面了解生物调控机制以及开发用于治疗目的的蛋白质和药物方面发挥着至关重要的作用。然而，有标记的蛋白质，如测序和功能注释数据，既不完整也很少。因此，对比学习已成为从无标记数据样本中学习有意义表征的首选技术。此外，目前从单一领域提取蛋白质知识并不能完全描述天然蛋白质。因此，本研究提出了基于多知识表征学习的蛋白质对比模型框架 Pro-CoRL。其中，Pro-CoRL 利用凸近似平滑目标函数，从而提高了训练的稳定性。在预测蛋白质-蛋白质相互作用类型和聚类蛋白质家族方面的大量实验证实了 Pro-CoRL 的高准确性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning protein language contrastive models with multi-knowledge representation

Protein representation learning plays a crucial role in obtaining a comprehensive understanding of biological regulatory mechanisms and in developing proteins and drugs for therapeutic purposes. However, labeled proteins, such as sequenced and functionally annotated data, are incomplete and few. Thus, contrastive learning has emerged as the preferred technique for learning meaningful representations from unlabeled data samples. In addition, at present, natural proteins cannot be fully described by extracting protein knowledge from a single domain. Therefore, Pro-CoRL, a protein contrastive models framework based on multi-knowledge representation learning, was proposed in this study. In particular, Pro-CoRL smooths the objective function using convex approximation, thereby improving the stability of training. Extensive experiments on predicting protein–protein interaction types and clustering protein families have confirmed the high accuracy and robustness of Pro-CoRL.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.