Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA

arXiv - QuanBio - Genomics Pub Date : 2024-03-28 DOI:arxiv-2403.19416

Tomáš Brůna, Lars Gabriel, Katharina J. Hoff

{"title":"Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA","authors":"Tomáš Brůna, Lars Gabriel, Katharina J. Hoff","doi":"arxiv-2403.19416","DOIUrl":null,"url":null,"abstract":"Annotating the structure of protein-coding genes represents a major challenge\nin the analysis of eukaryotic genomes. This task sets the groundwork for\nsubsequent genomic studies aimed at understanding the functions of individual\ngenes. BRAKER and Galba are two fully automated and containerized pipelines\ndesigned to perform accurate genome annotation. BRAKER integrates the\nGeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain\nhigh sensitivity and precision. BRAKER is adept at handling genomes of any\nsize, provided that it has access to both transcript expression sequencing data\nand an extensive protein database from the target clade. In particular, BRAKER\ndemonstrates high accuracy even with only one type of these extrinsic evidence\nsources, although it should be noted that accuracy diminishes for larger\ngenomes under such conditions. In contrast, Galba adopts a distinct methodology\nutilizing the outcomes of direct protein-to-genome spliced alignments using\nminiprot to generate training genes and evidence for gene prediction in\nAUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are\nthe only source of evidence. This chapter provides practical guidelines for\nemploying both pipelines in the annotation of eukaryotic genomes, with a focus\non insect genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.19416","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Annotating the structure of protein-coding genes represents a major challenge in the analysis of eukaryotic genomes. This task sets the groundwork for subsequent genomic studies aimed at understanding the functions of individual genes. BRAKER and Galba are two fully automated and containerized pipelines designed to perform accurate genome annotation. BRAKER integrates the GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain high sensitivity and precision. BRAKER is adept at handling genomes of any size, provided that it has access to both transcript expression sequencing data and an extensive protein database from the target clade. In particular, BRAKER demonstrates high accuracy even with only one type of these extrinsic evidence sources, although it should be noted that accuracy diminishes for larger genomes under such conditions. In contrast, Galba adopts a distinct methodology utilizing the outcomes of direct protein-to-genome spliced alignments using miniprot to generate training genes and evidence for gene prediction in AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are the only source of evidence. This chapter provides practical guidelines for employing both pipelines in the annotation of eukaryotic genomes, with a focus on insect genomes.

查看原文本刊更多论文

导航真核生物基因组注释管道：通往 BRAKER、Galba 和 TSEBRA 的路线图

注释蛋白质编码基因的结构是真核生物基因组分析中的一项重大挑战。这项任务为后续旨在了解单个基因功能的基因组研究奠定了基础。BRAKER 和 Galba 是两个全自动的容器化管道，旨在进行精确的基因组注释。BRAKER 集成了 GeneMark-ETP 和 AUGUSTUS 基因查找器，并采用 TSEBRA 组合器来实现高灵敏度和高精确度。BRAKER 擅长处理任何规模的基因组，前提是它能获得目标支系的转录本表达测序数据和大量蛋白质数据库。特别是，即使只有一种外在证据资源，BRAKER 也能表现出很高的准确性，不过需要注意的是，在这种条件下，较大基因组的准确性会降低。相比之下，Galba 采用了一种独特的方法，即利用 Miniprot 直接进行蛋白质与基因组剪接比对的结果来生成训练基因和证据，以便在 AUGUSTUS 中进行基因预测。如果蛋白质序列是唯一的证据来源，Galba 在大型基因组中具有更高的准确性。本章提供了在真核生物基因组注释中使用这两种管道的实用指南，重点是昆虫基因组。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量