CoCoPyE: feature engineering for learning and prediction of genome quality indices.

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2024-01-02 DOI:10.1093/gigascience/giae079

Niklas Birth, Nicolina Leppich, Julia Schirmacher, Nina Andreae, Rasmus Steinkamp, Matthias Blanke, Peter Meinicke

{"title":"CoCoPyE: feature engineering for learning and prediction of genome quality indices.","authors":"Niklas Birth, Nicolina Leppich, Julia Schirmacher, Nina Andreae, Rasmus Steinkamp, Matthias Blanke, Peter Meinicke","doi":"10.1093/gigascience/giae079","DOIUrl":null,"url":null,"abstract":"Background: The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.Results: We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.Conclusions: CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503480/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae079","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.

Results: We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.

Conclusions: CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.

查看原文本刊更多论文

CoCoPyE：用于学习和预测基因组质量指数的特征工程。

背景：通过元基因组序列数据重建基因组极大地推动了对微生物世界的探索。然而，元基因组组装基因组数量的迅速增加也导致了数据质量的巨大差异。因此，在将重建的基因组用于后续分析之前，必须对其达到的完整性和可能的污染进行量化。估算质量指数的经典方法仅依赖于相对较少的通用单拷贝基因。最近的工具试图扩大估算的基因组覆盖范围以提高准确性：我们开发了 CoCoPyE，这是一种基于新颖的两阶段特征提取和转换方案的快速工具。首先，它能识别基因组标记，然后通过机器学习方法完善基于标记的估计值。在我们的模拟研究中，CoCoPyE 对质量指标的预测比现有工具更准确。CoCoPyE 网络服务器提供了一种试用该工具的简便方法，而免费提供的 Python 实现则可将其集成到现有的基因组重建管道中：结论：CoCoPyE 提供了一种评估基因组数据质量的新方法。结论：CoCoPyE 提供了一种评估基因组数据质量的新方法，它是对现有工具的补充和改进，可帮助研究人员在元基因组测序项目中更好地区分低质量草案和高质量基因组组装。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.