TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-27 DOI:arxiv-2408.15299

Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang

{"title":"TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering","authors":"Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang","doi":"arxiv-2408.15299","DOIUrl":null,"url":null,"abstract":"The structural similarities between protein sequences and natural languages\nhave led to parallel advancements in deep learning across both domains. While\nlarge language models (LLMs) have achieved much progress in the domain of\nnatural language processing, their potential in protein engineering remains\nlargely unexplored. Previous approaches have equipped LLMs with protein\nunderstanding capabilities by incorporating external protein encoders, but this\nfails to fully leverage the inherent similarities between protein sequences and\nnatural languages, resulting in sub-optimal performance and increased model\ncomplexity. To address this gap, we present TourSynbio-7B, the first\nmulti-modal large model specifically designed for protein engineering tasks\nwithout external protein encoders. TourSynbio-7B demonstrates that LLMs can\ninherently learn to understand proteins as language. The model is post-trained\nand instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset\ncomprising 17.46 billion tokens of text and protein sequence for\nself-supervised pretraining and 893K instructions for supervised fine-tuning.\nTourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944\nmanually verified multiple-choice questions, with 62.18% accuracy. Leveraging\nTourSynbio-7B's enhanced protein sequence understanding capability, we\nintroduce TourSynbio-Agent, an innovative framework capable of performing\nvarious protein engineering tasks, including mutation analysis, inverse\nfolding, protein folding, and visualization. TourSynbio-Agent integrates\npreviously disconnected deep learning models in the protein engineering domain,\noffering a unified conversational user interface for improved usability.\nFinally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent\nthrough two wet lab case studies on vanilla key enzyme modification and steroid\ncompound catalysis.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The structural similarities between protein sequences and natural languages have led to parallel advancements in deep learning across both domains. While large language models (LLMs) have achieved much progress in the domain of natural language processing, their potential in protein engineering remains largely unexplored. Previous approaches have equipped LLMs with protein understanding capabilities by incorporating external protein encoders, but this fails to fully leverage the inherent similarities between protein sequences and natural languages, resulting in sub-optimal performance and increased model complexity. To address this gap, we present TourSynbio-7B, the first multi-modal large model specifically designed for protein engineering tasks without external protein encoders. TourSynbio-7B demonstrates that LLMs can inherently learn to understand proteins as language. The model is post-trained and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset comprising 17.46 billion tokens of text and protein sequence for self-supervised pretraining and 893K instructions for supervised fine-tuning. TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944 manually verified multiple-choice questions, with 62.18% accuracy. Leveraging TourSynbio-7B's enhanced protein sequence understanding capability, we introduce TourSynbio-Agent, an innovative framework capable of performing various protein engineering tasks, including mutation analysis, inverse folding, protein folding, and visualization. TourSynbio-Agent integrates previously disconnected deep learning models in the protein engineering domain, offering a unified conversational user interface for improved usability. Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent through two wet lab case studies on vanilla key enzyme modification and steroid compound catalysis.

查看原文本刊更多论文

TourSynbio：多模式大型模型和代理框架，为蛋白质工程架起文本和蛋白质序列之间的桥梁

蛋白质序列与自然语言在结构上的相似性促使深度学习在这两个领域都取得了平行进展。虽然大语言模型（LLM）在自然语言处理领域取得了很大进展，但其在蛋白质工程中的潜力仍未得到充分挖掘。以前的方法通过结合外部蛋白质编码器，使 LLM 具备了蛋白质理解能力，但这未能充分利用蛋白质序列与自然语言之间固有的相似性，导致性能未达到最佳，模型复杂性增加。为了弥补这一不足，我们推出了 TourSynbio-7B，它是首个专门为蛋白质工程任务设计的多模态大型模型，无需外部蛋白质编码器。TourSynbio-7B 证明了 LLM 本身可以学会将蛋白质理解为语言。该模型使用 ProteinLMDataset 在 InternLM2-7B 上进行了后训练和指令微调，ProteinLMDataset 包含 174.6 亿个文本和蛋白质序列词组，用于自我监督预训练，893K 个指令用于监督微调。TourSynbio-7B 在 ProteinLMBench 上的表现优于 GPT-4，GPT-4 的准确率为 62.18%，ProteinLMBench 是一个由 944 道人工验证的选择题组成的基准。利用 TourSynbio-7B 增强的蛋白质序列理解能力，我们推出了 TourSynbio-Agent，这是一个创新的框架，能够执行各种蛋白质工程任务，包括突变分析、反折叠、蛋白质折叠和可视化。最后，我们通过香草关键酶修饰和类固醇化合物催化这两个湿实验室案例研究，展示了 TourSynbio-7B 和 TourSynbio-Agent 的功效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量