{"title":"导航真核生物基因组注释管道:通往 BRAKER、Galba 和 TSEBRA 的路线图","authors":"Tomáš Brůna, Lars Gabriel, Katharina J. Hoff","doi":"arxiv-2403.19416","DOIUrl":null,"url":null,"abstract":"Annotating the structure of protein-coding genes represents a major challenge\nin the analysis of eukaryotic genomes. This task sets the groundwork for\nsubsequent genomic studies aimed at understanding the functions of individual\ngenes. BRAKER and Galba are two fully automated and containerized pipelines\ndesigned to perform accurate genome annotation. BRAKER integrates the\nGeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain\nhigh sensitivity and precision. BRAKER is adept at handling genomes of any\nsize, provided that it has access to both transcript expression sequencing data\nand an extensive protein database from the target clade. In particular, BRAKER\ndemonstrates high accuracy even with only one type of these extrinsic evidence\nsources, although it should be noted that accuracy diminishes for larger\ngenomes under such conditions. In contrast, Galba adopts a distinct methodology\nutilizing the outcomes of direct protein-to-genome spliced alignments using\nminiprot to generate training genes and evidence for gene prediction in\nAUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are\nthe only source of evidence. This chapter provides practical guidelines for\nemploying both pipelines in the annotation of eukaryotic genomes, with a focus\non insect genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA\",\"authors\":\"Tomáš Brůna, Lars Gabriel, Katharina J. Hoff\",\"doi\":\"arxiv-2403.19416\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Annotating the structure of protein-coding genes represents a major challenge\\nin the analysis of eukaryotic genomes. This task sets the groundwork for\\nsubsequent genomic studies aimed at understanding the functions of individual\\ngenes. BRAKER and Galba are two fully automated and containerized pipelines\\ndesigned to perform accurate genome annotation. BRAKER integrates the\\nGeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain\\nhigh sensitivity and precision. BRAKER is adept at handling genomes of any\\nsize, provided that it has access to both transcript expression sequencing data\\nand an extensive protein database from the target clade. In particular, BRAKER\\ndemonstrates high accuracy even with only one type of these extrinsic evidence\\nsources, although it should be noted that accuracy diminishes for larger\\ngenomes under such conditions. In contrast, Galba adopts a distinct methodology\\nutilizing the outcomes of direct protein-to-genome spliced alignments using\\nminiprot to generate training genes and evidence for gene prediction in\\nAUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are\\nthe only source of evidence. This chapter provides practical guidelines for\\nemploying both pipelines in the annotation of eukaryotic genomes, with a focus\\non insect genomes.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2403.19416\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.19416","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
注释蛋白质编码基因的结构是真核生物基因组分析中的一项重大挑战。这项任务为后续旨在了解单个基因功能的基因组研究奠定了基础。BRAKER 和 Galba 是两个全自动的容器化管道,旨在进行精确的基因组注释。BRAKER 集成了 GeneMark-ETP 和 AUGUSTUS 基因查找器,并采用 TSEBRA 组合器来实现高灵敏度和高精确度。BRAKER 擅长处理任何规模的基因组,前提是它能获得目标支系的转录本表达测序数据和大量蛋白质数据库。特别是,即使只有一种外在证据资源,BRAKER 也能表现出很高的准确性,不过需要注意的是,在这种条件下,较大基因组的准确性会降低。相比之下,Galba 采用了一种独特的方法,即利用 Miniprot 直接进行蛋白质与基因组剪接比对的结果来生成训练基因和证据,以便在 AUGUSTUS 中进行基因预测。如果蛋白质序列是唯一的证据来源,Galba 在大型基因组中具有更高的准确性。本章提供了在真核生物基因组注释中使用这两种管道的实用指南,重点是昆虫基因组。
Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA
Annotating the structure of protein-coding genes represents a major challenge
in the analysis of eukaryotic genomes. This task sets the groundwork for
subsequent genomic studies aimed at understanding the functions of individual
genes. BRAKER and Galba are two fully automated and containerized pipelines
designed to perform accurate genome annotation. BRAKER integrates the
GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain
high sensitivity and precision. BRAKER is adept at handling genomes of any
size, provided that it has access to both transcript expression sequencing data
and an extensive protein database from the target clade. In particular, BRAKER
demonstrates high accuracy even with only one type of these extrinsic evidence
sources, although it should be noted that accuracy diminishes for larger
genomes under such conditions. In contrast, Galba adopts a distinct methodology
utilizing the outcomes of direct protein-to-genome spliced alignments using
miniprot to generate training genes and evidence for gene prediction in
AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are
the only source of evidence. This chapter provides practical guidelines for
employing both pipelines in the annotation of eukaryotic genomes, with a focus
on insect genomes.