BUSCO:评估基因组数据质量及其他。

Current Protocols Pub Date : 2021-12-01 DOI:10.1002/cpz1.323

Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Evgeny M Zdobnov

{"title":"BUSCO:评估基因组数据质量及其他。","authors":"Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Evgeny M Zdobnov","doi":"10.1002/cpz1.323","DOIUrl":null,"url":null,"abstract":"Evaluation of the quality of genomic \"data products\" such as genome assemblies or gene sets is of critical importance in order to recognize possible issues and correct them during the generation of new data. It is equally essential to guide subsequent or comparative analyses with existing data, as the correct interpretation of the results necessarily requires knowledge about the quality level and reliability of the inputs. Using datasets of near universal single-copy orthologs derived from OrthoDB, BUSCO can estimate the completeness and redundancy of genomic data by providing biologically meaningful metrics based on expected gene content. These can complement technical metrics such as contiguity measures (e.g., number of contigs/scaffolds, and N50 values). Here, we describe the use of the BUSCO tool suite to assess different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome-assembled genomes where the taxonomic origin of the species is unknown. BUSCO is the only tool capable of assessing all these types of sequences from both eukaryotic and prokaryotic species. The protocols detail the various BUSCO running modes and the novel workflows introduced in versions 4 and 5, including the batch analysis on multiple inputs, the auto-lineage workflow to run assessments without specifying a dataset, and a workflow for the evaluation of (large) eukaryotic genomes. The protocols further cover the BUSCO setup, guidelines to interpret the results, and BUSCO \"plugin\" workflows for performing common operations in genomics using BUSCO results, such as building phylogenomic trees and visualizing syntenies. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Assessing an input sequence with a BUSCO dataset specified manually Basic Protocol 2: Assessing an input sequence with a dataset automatically selected by BUSCO Basic Protocol 3: Assessing multiple inputs Alternate Protocol: Decreasing analysis runtime when assessing a large number of small genomes with BUSCO auto-lineage workflow and Snakemake Support Protocol 1: BUSCO setup Support Protocol 2: Visualizing BUSCO results Support Protocol 3: Building phylogenomic trees.","PeriodicalId":11174,"journal":{"name":"Current Protocols","volume":" ","pages":"e323"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"219","resultStr":"{\"title\":\"BUSCO: Assessing Genomic Data Quality and Beyond.\",\"authors\":\"Mosè Manni, Matthew R Berkeley, Mathieu Seppey, Evgeny M Zdobnov\",\"doi\":\"10.1002/cpz1.323\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evaluation of the quality of genomic \\\"data products\\\" such as genome assemblies or gene sets is of critical importance in order to recognize possible issues and correct them during the generation of new data. It is equally essential to guide subsequent or comparative analyses with existing data, as the correct interpretation of the results necessarily requires knowledge about the quality level and reliability of the inputs. Using datasets of near universal single-copy orthologs derived from OrthoDB, BUSCO can estimate the completeness and redundancy of genomic data by providing biologically meaningful metrics based on expected gene content. These can complement technical metrics such as contiguity measures (e.g., number of contigs/scaffolds, and N50 values). Here, we describe the use of the BUSCO tool suite to assess different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome-assembled genomes where the taxonomic origin of the species is unknown. BUSCO is the only tool capable of assessing all these types of sequences from both eukaryotic and prokaryotic species. The protocols detail the various BUSCO running modes and the novel workflows introduced in versions 4 and 5, including the batch analysis on multiple inputs, the auto-lineage workflow to run assessments without specifying a dataset, and a workflow for the evaluation of (large) eukaryotic genomes. The protocols further cover the BUSCO setup, guidelines to interpret the results, and BUSCO \\\"plugin\\\" workflows for performing common operations in genomics using BUSCO results, such as building phylogenomic trees and visualizing syntenies. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Assessing an input sequence with a BUSCO dataset specified manually Basic Protocol 2: Assessing an input sequence with a dataset automatically selected by BUSCO Basic Protocol 3: Assessing multiple inputs Alternate Protocol: Decreasing analysis runtime when assessing a large number of small genomes with BUSCO auto-lineage workflow and Snakemake Support Protocol 1: BUSCO setup Support Protocol 2: Visualizing BUSCO results Support Protocol 3: Building phylogenomic trees.\",\"PeriodicalId\":11174,\"journal\":{\"name\":\"Current Protocols\",\"volume\":\" \",\"pages\":\"e323\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"219\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current Protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/cpz1.323\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cpz1.323","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 219

摘要

评估基因组“数据产品”(如基因组组装或基因集)的质量对于识别可能存在的问题并在生成新数据期间纠正这些问题至关重要。同样重要的是用现有数据指导后续或比较分析，因为对结果的正确解释必然需要了解输入的质量水平和可靠性。利用来自OrthoDB的近乎通用的单拷贝同源基因数据集，BUSCO可以通过提供基于预期基因含量的有生物学意义的指标来估计基因组数据的完整性和冗余性。这些可以补充技术指标，如邻近测量(例如，组件/支架的数量和N50值)。在这里，我们描述了BUSCO工具套件的使用，以评估不同的数据类型，其范围可以从单个分离物的基因组组装和组装转录组和注释基因集，到物种分类起源未知的宏基因组组装基因组。BUSCO是唯一能够评估真核生物和原核生物物种中所有这些类型序列的工具。协议详细介绍了各种BUSCO运行模式和版本4和5中引入的新工作流程，包括对多个输入进行批量分析，在不指定数据集的情况下运行评估的自动谱系工作流程，以及用于评估(大型)真核基因组的工作流程。协议进一步涵盖了BUSCO设置、解释结果的指南，以及使用BUSCO结果执行基因组学中常见操作的BUSCO“插件”工作流程，例如构建系统基因组树和可视化合成。©2021作者。当前协议由Wiley期刊有限责任公司发布。基本协议1:评估输入序列与手动指定的BUSCO数据集基本协议2:评估输入序列与BUSCO自动选择的数据集基本协议3:评估多个输入备用协议:减少分析运行时，评估大量的小基因组与BUSCO自动谱系工作流和Snakemake支持协议1:BUSCO设置支持协议2:可视化BUSCO结果支持协议3:构建系统基因组树。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BUSCO: Assessing Genomic Data Quality and Beyond.

Evaluation of the quality of genomic "data products" such as genome assemblies or gene sets is of critical importance in order to recognize possible issues and correct them during the generation of new data. It is equally essential to guide subsequent or comparative analyses with existing data, as the correct interpretation of the results necessarily requires knowledge about the quality level and reliability of the inputs. Using datasets of near universal single-copy orthologs derived from OrthoDB, BUSCO can estimate the completeness and redundancy of genomic data by providing biologically meaningful metrics based on expected gene content. These can complement technical metrics such as contiguity measures (e.g., number of contigs/scaffolds, and N50 values). Here, we describe the use of the BUSCO tool suite to assess different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome-assembled genomes where the taxonomic origin of the species is unknown. BUSCO is the only tool capable of assessing all these types of sequences from both eukaryotic and prokaryotic species. The protocols detail the various BUSCO running modes and the novel workflows introduced in versions 4 and 5, including the batch analysis on multiple inputs, the auto-lineage workflow to run assessments without specifying a dataset, and a workflow for the evaluation of (large) eukaryotic genomes. The protocols further cover the BUSCO setup, guidelines to interpret the results, and BUSCO "plugin" workflows for performing common operations in genomics using BUSCO results, such as building phylogenomic trees and visualizing syntenies. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Assessing an input sequence with a BUSCO dataset specified manually Basic Protocol 2: Assessing an input sequence with a dataset automatically selected by BUSCO Basic Protocol 3: Assessing multiple inputs Alternate Protocol: Decreasing analysis runtime when assessing a large number of small genomes with BUSCO auto-lineage workflow and Snakemake Support Protocol 1: BUSCO setup Support Protocol 2: Visualizing BUSCO results Support Protocol 3: Building phylogenomic trees.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Current Protocols

自引率

0.00%

发文量