CASP16-CAPRI的海量数据：一个系统的海量采样实验。

IF 2.8 4区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Proteins-Structure Function and Bioinformatics Pub Date : 2025-08-28 DOI:10.1002/prot.70040

Nessim Raouraoua, Marc F Lensink, Guillaume Brysbaert

{"title":"CASP16-CAPRI的海量数据：一个系统的海量采样实验。","authors":"Nessim Raouraoua, Marc F Lensink, Guillaume Brysbaert","doi":"10.1002/prot.70040","DOIUrl":null,"url":null,"abstract":"Massive sampling with AlphaFold2 has become a widely used approach in protein structure prediction. Here we present the MassiveFold CASP16-CAPRI dataset, a systematic, large-scale sampling of both monomeric and multimeric protein targets. By exploiting maximal parallelization, we produced up to 8040 models per target and shared them with the community for collaborative selection and scoring. This collective effort minimizes redundant computation and environmental impact, while granting resource-limited groups - especially those focused on scoring - access to high quality structures. In our analysis, we define an interface-difficulty classification based on DockQ metrics, showing that massive sampling yields the greatest gains on most of the challenging interfaces. Crucially, this classification can be predicted from the median ipTM scores of a routine AF2 run, enabling users to selectively deploy massive sampling only when it is most needed. Combined with a reduction of the massive sampling from 8040 to 2475 predictions, such targeted strategies dramatically cut computation time and resource use with minimal loss of accuracy. Finally, we underscore the persistent challenge of choosing optimal models from massive sampling datasets, emphasizing the need for more robust scoring methods. The MassiveFold datasets, together with AlphaFold ranking scores and CASP and CAPRI assessment metrics, are publicly available at https://github.com/GBLille/CASP16-CAPRI_MassiveFold_Data to accelerate further progress in protein structure prediction and assembly modeling.","PeriodicalId":56271,"journal":{"name":"Proteins-Structure Function and Bioinformatics","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MassiveFold Data for CASP16-CAPRI: A Systematic Massive Sampling Experiment.\",\"authors\":\"Nessim Raouraoua, Marc F Lensink, Guillaume Brysbaert\",\"doi\":\"10.1002/prot.70040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Massive sampling with AlphaFold2 has become a widely used approach in protein structure prediction. Here we present the MassiveFold CASP16-CAPRI dataset, a systematic, large-scale sampling of both monomeric and multimeric protein targets. By exploiting maximal parallelization, we produced up to 8040 models per target and shared them with the community for collaborative selection and scoring. This collective effort minimizes redundant computation and environmental impact, while granting resource-limited groups - especially those focused on scoring - access to high quality structures. In our analysis, we define an interface-difficulty classification based on DockQ metrics, showing that massive sampling yields the greatest gains on most of the challenging interfaces. Crucially, this classification can be predicted from the median ipTM scores of a routine AF2 run, enabling users to selectively deploy massive sampling only when it is most needed. Combined with a reduction of the massive sampling from 8040 to 2475 predictions, such targeted strategies dramatically cut computation time and resource use with minimal loss of accuracy. Finally, we underscore the persistent challenge of choosing optimal models from massive sampling datasets, emphasizing the need for more robust scoring methods. The MassiveFold datasets, together with AlphaFold ranking scores and CASP and CAPRI assessment metrics, are publicly available at https://github.com/GBLille/CASP16-CAPRI_MassiveFold_Data to accelerate further progress in protein structure prediction and assembly modeling.\",\"PeriodicalId\":56271,\"journal\":{\"name\":\"Proteins-Structure Function and Bioinformatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proteins-Structure Function and Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1002/prot.70040\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteins-Structure Function and Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/prot.70040","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

利用AlphaFold2进行大规模采样已成为蛋白质结构预测中广泛使用的方法。在这里，我们展示了MassiveFold CASP16-CAPRI数据集，这是一个系统的、大规模的单体和多聚体蛋白靶点采样。通过利用最大的并行化，我们为每个目标生成了多达8040个模型，并与社区共享它们以进行协作选择和评分。这种集体努力最大限度地减少了冗余计算和对环境的影响，同时使资源有限的小组——特别是那些专注于得分的小组——能够获得高质量的结构。在我们的分析中，我们定义了一个基于DockQ指标的接口难度分类，显示大量采样在大多数具有挑战性的接口上产生最大的收益。至关重要的是，这种分类可以从常规AF2运行的ipTM分数中位数预测，使用户能够在最需要的时候有选择地部署大规模采样。结合将大规模采样从8040个减少到2475个预测，这种有针对性的策略显着减少了计算时间和资源使用，并将准确性损失降到最低。最后，我们强调了从大量采样数据集中选择最优模型的持续挑战，强调需要更健壮的评分方法。MassiveFold数据集以及AlphaFold排名分数和CASP和CAPRI评估指标可在https://github.com/GBLille/CASP16-CAPRI_MassiveFold_Data上公开获取，以加速蛋白质结构预测和组装建模的进一步进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MassiveFold Data for CASP16-CAPRI: A Systematic Massive Sampling Experiment.

Massive sampling with AlphaFold2 has become a widely used approach in protein structure prediction. Here we present the MassiveFold CASP16-CAPRI dataset, a systematic, large-scale sampling of both monomeric and multimeric protein targets. By exploiting maximal parallelization, we produced up to 8040 models per target and shared them with the community for collaborative selection and scoring. This collective effort minimizes redundant computation and environmental impact, while granting resource-limited groups - especially those focused on scoring - access to high quality structures. In our analysis, we define an interface-difficulty classification based on DockQ metrics, showing that massive sampling yields the greatest gains on most of the challenging interfaces. Crucially, this classification can be predicted from the median ipTM scores of a routine AF2 run, enabling users to selectively deploy massive sampling only when it is most needed. Combined with a reduction of the massive sampling from 8040 to 2475 predictions, such targeted strategies dramatically cut computation time and resource use with minimal loss of accuracy. Finally, we underscore the persistent challenge of choosing optimal models from massive sampling datasets, emphasizing the need for more robust scoring methods. The MassiveFold datasets, together with AlphaFold ranking scores and CASP and CAPRI assessment metrics, are publicly available at https://github.com/GBLille/CASP16-CAPRI_MassiveFold_Data to accelerate further progress in protein structure prediction and assembly modeling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proteins-Structure Function and Bioinformatics 生物-生化与分子生物学

CiteScore

5.90

自引率

3.40%

发文量

172

审稿时长

3 months

期刊介绍： PROTEINS : Structure, Function, and Bioinformatics publishes original reports of significant experimental and analytic research in all areas of protein research: structure, function, computation, genetics, and design. The journal encourages reports that present new experimental or computational approaches for interpreting and understanding data from biophysical chemistry, structural studies of proteins and macromolecular assemblies, alterations of protein structure and function engineered through techniques of molecular biology and genetics, functional analyses under physiologic conditions, as well as the interactions of proteins with receptors, nucleic acids, or other specific ligands or substrates. Research in protein and peptide biochemistry directed toward synthesizing or characterizing molecules that simulate aspects of the activity of proteins, or that act as inhibitors of protein function, is also within the scope of PROTEINS. In addition to full-length reports, short communications (usually not more than 4 printed pages) and prediction reports are welcome. Reviews are typically by invitation; authors are encouraged to submit proposed topics for consideration.