Amy X Lu, Wilson Yan, Kevin K Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, Nathan C Frey
{"title":"蛋白质序列和结构的标记化和连续嵌入压缩。","authors":"Amy X Lu, Wilson Yan, Kevin K Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, Nathan C Frey","doi":"10.1016/j.patter.2025.101289","DOIUrl":null,"url":null,"abstract":"<p><p>Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. Here, we characterize an embedding of the joint distribution of protein sequence and structure by compressing the latent space of the protein folding model ESMFold. This provides mechanistic interpretability insights, as well as a flexible compressed representation. We term these CHEAP (compressed hourglass embedding adaptations of proteins) embeddings. In continuous compression schemes, the ESMFold latent space can be reduced by factors of 128 <math><mrow><mo>×</mo></mrow> </math> along the channel and 8 <math><mrow><mo>×</mo></mrow> </math> along the length while retaining structure information at <2 Å scale accuracy and performing competitively on protein function and localization benchmarks. In discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of an all-atom structure that can be obtained from the sequence alone. CHEAP democratizes representations captured by large models and can enable flexible downstream applications such as generation, search, and prediction.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 6","pages":"101289"},"PeriodicalIF":6.7000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191763/pdf/","citationCount":"0","resultStr":"{\"title\":\"Tokenized and continuous embedding compressions of protein sequence and structure.\",\"authors\":\"Amy X Lu, Wilson Yan, Kevin K Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, Nathan C Frey\",\"doi\":\"10.1016/j.patter.2025.101289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. Here, we characterize an embedding of the joint distribution of protein sequence and structure by compressing the latent space of the protein folding model ESMFold. This provides mechanistic interpretability insights, as well as a flexible compressed representation. We term these CHEAP (compressed hourglass embedding adaptations of proteins) embeddings. In continuous compression schemes, the ESMFold latent space can be reduced by factors of 128 <math><mrow><mo>×</mo></mrow> </math> along the channel and 8 <math><mrow><mo>×</mo></mrow> </math> along the length while retaining structure information at <2 Å scale accuracy and performing competitively on protein function and localization benchmarks. In discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of an all-atom structure that can be obtained from the sequence alone. CHEAP democratizes representations captured by large models and can enable flexible downstream applications such as generation, search, and prediction.</p>\",\"PeriodicalId\":36242,\"journal\":{\"name\":\"Patterns\",\"volume\":\"6 6\",\"pages\":\"101289\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191763/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Patterns\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.patter.2025.101289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Patterns","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.patter.2025.101289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Tokenized and continuous embedding compressions of protein sequence and structure.
Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. Here, we characterize an embedding of the joint distribution of protein sequence and structure by compressing the latent space of the protein folding model ESMFold. This provides mechanistic interpretability insights, as well as a flexible compressed representation. We term these CHEAP (compressed hourglass embedding adaptations of proteins) embeddings. In continuous compression schemes, the ESMFold latent space can be reduced by factors of 128 along the channel and 8 along the length while retaining structure information at <2 Å scale accuracy and performing competitively on protein function and localization benchmarks. In discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of an all-atom structure that can be obtained from the sequence alone. CHEAP democratizes representations captured by large models and can enable flexible downstream applications such as generation, search, and prediction.