Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang
{"title":"TourSynbio:多模式大型模型和代理框架,为蛋白质工程架起文本和蛋白质序列之间的桥梁","authors":"Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang","doi":"arxiv-2408.15299","DOIUrl":null,"url":null,"abstract":"The structural similarities between protein sequences and natural languages\nhave led to parallel advancements in deep learning across both domains. While\nlarge language models (LLMs) have achieved much progress in the domain of\nnatural language processing, their potential in protein engineering remains\nlargely unexplored. Previous approaches have equipped LLMs with protein\nunderstanding capabilities by incorporating external protein encoders, but this\nfails to fully leverage the inherent similarities between protein sequences and\nnatural languages, resulting in sub-optimal performance and increased model\ncomplexity. To address this gap, we present TourSynbio-7B, the first\nmulti-modal large model specifically designed for protein engineering tasks\nwithout external protein encoders. TourSynbio-7B demonstrates that LLMs can\ninherently learn to understand proteins as language. The model is post-trained\nand instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset\ncomprising 17.46 billion tokens of text and protein sequence for\nself-supervised pretraining and 893K instructions for supervised fine-tuning.\nTourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944\nmanually verified multiple-choice questions, with 62.18% accuracy. Leveraging\nTourSynbio-7B's enhanced protein sequence understanding capability, we\nintroduce TourSynbio-Agent, an innovative framework capable of performing\nvarious protein engineering tasks, including mutation analysis, inverse\nfolding, protein folding, and visualization. TourSynbio-Agent integrates\npreviously disconnected deep learning models in the protein engineering domain,\noffering a unified conversational user interface for improved usability.\nFinally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent\nthrough two wet lab case studies on vanilla key enzyme modification and steroid\ncompound catalysis.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering\",\"authors\":\"Yiqing Shen, Zan Chen, Michail Mamalakis, Yungeng Liu, Tianbin Li, Yanzhou Su, Junjun He, Pietro Liò, Yu Guang Wang\",\"doi\":\"arxiv-2408.15299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The structural similarities between protein sequences and natural languages\\nhave led to parallel advancements in deep learning across both domains. While\\nlarge language models (LLMs) have achieved much progress in the domain of\\nnatural language processing, their potential in protein engineering remains\\nlargely unexplored. Previous approaches have equipped LLMs with protein\\nunderstanding capabilities by incorporating external protein encoders, but this\\nfails to fully leverage the inherent similarities between protein sequences and\\nnatural languages, resulting in sub-optimal performance and increased model\\ncomplexity. To address this gap, we present TourSynbio-7B, the first\\nmulti-modal large model specifically designed for protein engineering tasks\\nwithout external protein encoders. TourSynbio-7B demonstrates that LLMs can\\ninherently learn to understand proteins as language. The model is post-trained\\nand instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset\\ncomprising 17.46 billion tokens of text and protein sequence for\\nself-supervised pretraining and 893K instructions for supervised fine-tuning.\\nTourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944\\nmanually verified multiple-choice questions, with 62.18% accuracy. Leveraging\\nTourSynbio-7B's enhanced protein sequence understanding capability, we\\nintroduce TourSynbio-Agent, an innovative framework capable of performing\\nvarious protein engineering tasks, including mutation analysis, inverse\\nfolding, protein folding, and visualization. TourSynbio-Agent integrates\\npreviously disconnected deep learning models in the protein engineering domain,\\noffering a unified conversational user interface for improved usability.\\nFinally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent\\nthrough two wet lab case studies on vanilla key enzyme modification and steroid\\ncompound catalysis.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.15299\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering
The structural similarities between protein sequences and natural languages
have led to parallel advancements in deep learning across both domains. While
large language models (LLMs) have achieved much progress in the domain of
natural language processing, their potential in protein engineering remains
largely unexplored. Previous approaches have equipped LLMs with protein
understanding capabilities by incorporating external protein encoders, but this
fails to fully leverage the inherent similarities between protein sequences and
natural languages, resulting in sub-optimal performance and increased model
complexity. To address this gap, we present TourSynbio-7B, the first
multi-modal large model specifically designed for protein engineering tasks
without external protein encoders. TourSynbio-7B demonstrates that LLMs can
inherently learn to understand proteins as language. The model is post-trained
and instruction fine-tuned on InternLM2-7B using ProteinLMDataset, a dataset
comprising 17.46 billion tokens of text and protein sequence for
self-supervised pretraining and 893K instructions for supervised fine-tuning.
TourSynbio-7B outperforms GPT-4 on the ProteinLMBench, a benchmark of 944
manually verified multiple-choice questions, with 62.18% accuracy. Leveraging
TourSynbio-7B's enhanced protein sequence understanding capability, we
introduce TourSynbio-Agent, an innovative framework capable of performing
various protein engineering tasks, including mutation analysis, inverse
folding, protein folding, and visualization. TourSynbio-Agent integrates
previously disconnected deep learning models in the protein engineering domain,
offering a unified conversational user interface for improved usability.
Finally, we demonstrate the efficacy of TourSynbio-7B and TourSynbio-Agent
through two wet lab case studies on vanilla key enzyme modification and steroid
compound catalysis.