Structure-informed protein language models are robust predictors for variant effects.

IF 3.6 2区生物学 Q2 GENETICS & HEREDITY

Human Genetics Pub Date : 2025-03-01 Epub Date: 2024-08-08 DOI:10.1007/s00439-024-02695-w

Yuanfei Sun, Yang Shen

{"title":"Structure-informed protein language models are robust predictors for variant effects.","authors":"Yuanfei Sun, Yang Shen","doi":"10.1007/s00439-024-02695-w","DOIUrl":null,"url":null,"abstract":"<p><p>Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":" ","pages":"209-225"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12068927/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00439-024-02695-w","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/8 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.

Abstract Image

查看原文本刊更多论文

结构信息蛋白质语言模型是变异效应的稳健预测器。

作为新兴的变异效应预测工具，蛋白质语言模型（pLMs）通过学习功能序列的进化分布来捕捉适应性景观。考虑到变异效应是通过序列之外的生物背景（如结构）表现出来的，我们首先评估了纯序列 pLMs 学习到的结构背景对变异效应预测的影响程度。我们认为有必要有目的、可控地将蛋白质结构背景注入 pLM。因此，我们引入了结构信息 pLMs（SI-pLMs）框架，将屏蔽序列去噪扩展到序列和结构的跨模态去噪。对深度诱变扫描基准的数值结果表明，即使使用较小的模型和较少的数据，我们的SI-pLMs在与包括其他pLMs在内的竞争方法的竞争中也能稳健地名列前茅。案例研究表明，与纯序列 pLMs 相比，SI-pLMs 可以更好地捕捉适配性景观，这是因为：（a）低/高适配性序列的学习嵌入更容易分离；（b）功能和进化保守残基的学习氨基酸分布的熵值可能比其他残基低得多，因此保守性也更高。通过模型结构和训练目标，我们的 SI-pLMs 适用于修正任何纯序列 pLMs。它们不需要结构数据作为变异效应预测的模型输入，在训练过程中只使用结构作为上下文提供者和模型规整器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Human Genetics 生物-遗传学

CiteScore

10.80

自引率

3.80%

发文量

审稿时长

1 months

期刊介绍： Human Genetics is a monthly journal publishing original and timely articles on all aspects of human genetics. The Journal particularly welcomes articles in the areas of Behavioral genetics, Bioinformatics, Cancer genetics and genomics, Cytogenetics, Developmental genetics, Disease association studies, Dysmorphology, ELSI (ethical, legal and social issues), Evolutionary genetics, Gene expression, Gene structure and organization, Genetics of complex diseases and epistatic interactions, Genetic epidemiology, Genome biology, Genome structure and organization, Genotype-phenotype relationships, Human Genomics, Immunogenetics and genomics, Linkage analysis and genetic mapping, Methods in Statistical Genetics, Molecular diagnostics, Mutation detection and analysis, Neurogenetics, Physical mapping and Population Genetics. Articles reporting animal models relevant to human biology or disease are also welcome. Preference will be given to those articles which address clinically relevant questions or which provide new insights into human biology. Unless reporting entirely novel and unusual aspects of a topic, clinical case reports, cytogenetic case reports, papers on descriptive population genetics, articles dealing with the frequency of polymorphisms or additional mutations within genes in which numerous lesions have already been described, and papers that report meta-analyses of previously published datasets will normally not be accepted. The Journal typically will not consider for publication manuscripts that report merely the isolation, map position, structure, and tissue expression profile of a gene of unknown function unless the gene is of particular interest or is a candidate gene involved in a human trait or disorder.