{"title":"利用多知识表示学习蛋白质语言对比模型","authors":"Wenjun Xu , Yingchun Xia , Bifan Sun , Zihao Zhao , Lianggui Tang , Xiaobo Zhou , Qingyong Wang , Lichuan Gu","doi":"10.1016/j.future.2024.107580","DOIUrl":null,"url":null,"abstract":"<div><div>Protein representation learning plays a crucial role in obtaining a comprehensive understanding of biological regulatory mechanisms and in developing proteins and drugs for therapeutic purposes. However, labeled proteins, such as sequenced and functionally annotated data, are incomplete and few. Thus, contrastive learning has emerged as the preferred technique for learning meaningful representations from unlabeled data samples. In addition, at present, natural proteins cannot be fully described by extracting protein knowledge from a single domain. Therefore, Pro-CoRL, a <u>pro</u>tein <u>co</u>ntrastive models framework based on multi-knowledge <u>r</u>epresentation <u>l</u>earning, was proposed in this study. In particular, Pro-CoRL smooths the objective function using convex approximation, thereby improving the stability of training. Extensive experiments on predicting protein–protein interaction types and clustering protein families have confirmed the high accuracy and robustness of Pro-CoRL.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"164 ","pages":"Article 107580"},"PeriodicalIF":6.2000,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning protein language contrastive models with multi-knowledge representation\",\"authors\":\"Wenjun Xu , Yingchun Xia , Bifan Sun , Zihao Zhao , Lianggui Tang , Xiaobo Zhou , Qingyong Wang , Lichuan Gu\",\"doi\":\"10.1016/j.future.2024.107580\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Protein representation learning plays a crucial role in obtaining a comprehensive understanding of biological regulatory mechanisms and in developing proteins and drugs for therapeutic purposes. However, labeled proteins, such as sequenced and functionally annotated data, are incomplete and few. Thus, contrastive learning has emerged as the preferred technique for learning meaningful representations from unlabeled data samples. In addition, at present, natural proteins cannot be fully described by extracting protein knowledge from a single domain. Therefore, Pro-CoRL, a <u>pro</u>tein <u>co</u>ntrastive models framework based on multi-knowledge <u>r</u>epresentation <u>l</u>earning, was proposed in this study. In particular, Pro-CoRL smooths the objective function using convex approximation, thereby improving the stability of training. Extensive experiments on predicting protein–protein interaction types and clustering protein families have confirmed the high accuracy and robustness of Pro-CoRL.</div></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"164 \",\"pages\":\"Article 107580\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2024-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X24005442\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005442","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Learning protein language contrastive models with multi-knowledge representation
Protein representation learning plays a crucial role in obtaining a comprehensive understanding of biological regulatory mechanisms and in developing proteins and drugs for therapeutic purposes. However, labeled proteins, such as sequenced and functionally annotated data, are incomplete and few. Thus, contrastive learning has emerged as the preferred technique for learning meaningful representations from unlabeled data samples. In addition, at present, natural proteins cannot be fully described by extracting protein knowledge from a single domain. Therefore, Pro-CoRL, a protein contrastive models framework based on multi-knowledge representation learning, was proposed in this study. In particular, Pro-CoRL smooths the objective function using convex approximation, thereby improving the stability of training. Extensive experiments on predicting protein–protein interaction types and clustering protein families have confirmed the high accuracy and robustness of Pro-CoRL.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.