PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modeling and machine learning.

IF 1.1 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Beatriz Costa-Gomes, Joel Greer, Nikolai Juraschko, James Parkhurst, Jola Mirecka, Marjan Famili, Camila Rangel-Smith, Oliver Strickson, Alan Lowe, Mark Basham, Tom Burnley
{"title":"PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modeling and machine learning.","authors":"Beatriz Costa-Gomes, Joel Greer, Nikolai Juraschko, James Parkhurst, Jola Mirecka, Marjan Famili, Camila Rangel-Smith, Oliver Strickson, Alan Lowe, Mark Basham, Tom Burnley","doi":"10.1107/S2053230X25007575","DOIUrl":null,"url":null,"abstract":"<p><p>Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM data sets and/or creating new synthetic cryoEM data sets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning-based algorithms to aid numerous steps in the processing and reconstruction of experimental data sets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large data sets, which can be cumbersome to curate and unwieldy to make use of. In this paper, we present a suite of Python software packages, which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or AlphaFold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive data sets in a machine-learning-compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine-learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labeling. These packages may be utilized independently or as building blocks in workflows. All are available in open-source repositories and designed to be easily extensible to facilitate more advanced workflows if required.</p>","PeriodicalId":7029,"journal":{"name":"Acta crystallographica. Section F, Structural biology communications","volume":" ","pages":"441-450"},"PeriodicalIF":1.1000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12485494/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta crystallographica. Section F, Structural biology communications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1107/S2053230X25007575","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/9 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM data sets and/or creating new synthetic cryoEM data sets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning-based algorithms to aid numerous steps in the processing and reconstruction of experimental data sets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large data sets, which can be cumbersome to curate and unwieldy to make use of. In this paper, we present a suite of Python software packages, which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or AlphaFold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive data sets in a machine-learning-compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine-learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labeling. These packages may be utilized independently or as building blocks in workflows. All are available in open-source repositories and designed to be easily extensible to facilitate more advanced workflows if required.

PERC:一套用于冷冻电镜数据管理的软件工具,应用于模拟、建模和机器学习。
方便地获取数据、工具和模型加快了科学研究。在结构生物学中,现在有许多开放的实验和模拟数据集存储库。能够轻松地访问和利用这些数据对于使研究人员能够最佳地利用他们的研究成果至关重要。本文介绍的工具可用于整理现有的公共冷冻电镜数据集和/或创建新的合成冷冻电镜数据集,以帮助开发新的数据处理和解释算法。近年来,结构生物学已经看到了许多基于机器学习的算法的发展,以帮助处理和重建实验数据集的许多步骤,这些方法的使用已经变得广泛。在结构生物学中发展这样的技术需要访问大型数据集,这些数据集管理起来很麻烦,使用起来也很笨拙。在本文中,我们介绍了一套Python软件包,我们将其统称为PERC (profet, EMPIARreader和CAKED)。这些都是为了减轻数据管理给结构生物学研究带来的负担。蛋白质结构获取(profet)包允许用户方便地从蛋白质数据库或AlphaFold数据库下载和切割序列或结构。EMPIARreader允许在机器学习兼容的结构中惰性加载电子显微镜公共图像归档数据集。关键电子显微镜数据类聚合器(CAKED)包旨在无缝地促进电子显微镜数据上的机器学习模型的训练,包括电子冷冻显微镜特定数据的增强和标记。这些包可以独立使用,也可以作为工作流中的构建块使用。所有这些都可以在开源存储库中获得,并且设计为易于扩展,以便在需要时促进更高级的工作流。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Acta crystallographica. Section F, Structural biology communications
Acta crystallographica. Section F, Structural biology communications BIOCHEMICAL RESEARCH METHODSBIOCHEMISTRY &-BIOCHEMISTRY & MOLECULAR BIOLOGY
CiteScore
1.90
自引率
0.00%
发文量
95
期刊介绍: Acta Crystallographica Section F is a rapid structural biology communications journal. Articles on any aspect of structural biology, including structures determined using high-throughput methods or from iterative studies such as those used in the pharmaceutical industry, are welcomed by the journal. The journal offers the option of open access, and all communications benefit from unlimited free use of colour illustrations and no page charges. Authors are encouraged to submit multimedia content for publication with their articles. Acta Cryst. F has a dedicated online tool called publBio that is designed to make the preparation and submission of articles easier for authors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信