Beatriz Costa-Gomes, Joel Greer, Nikolai Juraschko, James Parkhurst, Jola Mirecka, Marjan Famili, Camila Rangel-Smith, Oliver Strickson, Alan Lowe, Mark Basham, Tom Burnley
{"title":"PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modeling and machine learning.","authors":"Beatriz Costa-Gomes, Joel Greer, Nikolai Juraschko, James Parkhurst, Jola Mirecka, Marjan Famili, Camila Rangel-Smith, Oliver Strickson, Alan Lowe, Mark Basham, Tom Burnley","doi":"10.1107/S2053230X25007575","DOIUrl":null,"url":null,"abstract":"<p><p>Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM data sets and/or creating new synthetic cryoEM data sets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning-based algorithms to aid numerous steps in the processing and reconstruction of experimental data sets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large data sets, which can be cumbersome to curate and unwieldy to make use of. In this paper, we present a suite of Python software packages, which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or AlphaFold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive data sets in a machine-learning-compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine-learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labeling. These packages may be utilized independently or as building blocks in workflows. All are available in open-source repositories and designed to be easily extensible to facilitate more advanced workflows if required.</p>","PeriodicalId":7029,"journal":{"name":"Acta crystallographica. Section F, Structural biology communications","volume":" ","pages":"441-450"},"PeriodicalIF":1.1000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12485494/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta crystallographica. Section F, Structural biology communications","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1107/S2053230X25007575","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/9 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM data sets and/or creating new synthetic cryoEM data sets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning-based algorithms to aid numerous steps in the processing and reconstruction of experimental data sets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large data sets, which can be cumbersome to curate and unwieldy to make use of. In this paper, we present a suite of Python software packages, which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or AlphaFold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive data sets in a machine-learning-compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine-learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labeling. These packages may be utilized independently or as building blocks in workflows. All are available in open-source repositories and designed to be easily extensible to facilitate more advanced workflows if required.
期刊介绍:
Acta Crystallographica Section F is a rapid structural biology communications journal.
Articles on any aspect of structural biology, including structures determined using high-throughput methods or from iterative studies such as those used in the pharmaceutical industry, are welcomed by the journal.
The journal offers the option of open access, and all communications benefit from unlimited free use of colour illustrations and no page charges. Authors are encouraged to submit multimedia content for publication with their articles.
Acta Cryst. F has a dedicated online tool called publBio that is designed to make the preparation and submission of articles easier for authors.