{"title":"由大型语言模型定义的折叠蛋白质序列空间结构。","authors":"A Zambon, R Zecchina, G Tiana","doi":"10.1088/1478-3975/ad205c","DOIUrl":null,"url":null,"abstract":"<p><p>Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.</p>","PeriodicalId":20207,"journal":{"name":"Physical biology","volume":null,"pages":null},"PeriodicalIF":2.0000,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Structure of the space of folding protein sequences defined by large language models.\",\"authors\":\"A Zambon, R Zecchina, G Tiana\",\"doi\":\"10.1088/1478-3975/ad205c\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.</p>\",\"PeriodicalId\":20207,\"journal\":{\"name\":\"Physical biology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-01-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Physical biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1088/1478-3975/ad205c\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Physical biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1088/1478-3975/ad205c","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Structure of the space of folding protein sequences defined by large language models.
Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.
期刊介绍:
Physical Biology publishes articles in the broad interdisciplinary field bridging biology with the physical sciences and engineering. This journal focuses on research in which quantitative approaches – experimental, theoretical and modeling – lead to new insights into biological systems at all scales of space and time, and all levels of organizational complexity.
Physical Biology accepts contributions from a wide range of biological sub-fields, including topics such as:
molecular biophysics, including single molecule studies, protein-protein and protein-DNA interactions
subcellular structures, organelle dynamics, membranes, protein assemblies, chromosome structure
intracellular processes, e.g. cytoskeleton dynamics, cellular transport, cell division
systems biology, e.g. signaling, gene regulation and metabolic networks
cells and their microenvironment, e.g. cell mechanics and motility, chemotaxis, extracellular matrix, biofilms
cell-material interactions, e.g. biointerfaces, electrical stimulation and sensing, endocytosis
cell-cell interactions, cell aggregates, organoids, tissues and organs
developmental dynamics, including pattern formation and morphogenesis
physical and evolutionary aspects of disease, e.g. cancer progression, amyloid formation
neuronal systems, including information processing by networks, memory and learning
population dynamics, ecology, and evolution
collective action and emergence of collective phenomena.