Computational graph pangenomics: a tutorial on data structures and their applications.

IF 1.6 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Computing Pub Date : 2022-03-01 Epub Date: 2022-03-04 DOI:10.1007/s11047-022-09882-6

Jasmijn A Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, Jouni Sirén

{"title":"Computational graph pangenomics: a tutorial on data structures and their applications.","authors":"Jasmijn A Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, Jouni Sirén","doi":"10.1007/s11047-022-09882-6","DOIUrl":null,"url":null,"abstract":"Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.","PeriodicalId":49783,"journal":{"name":"Natural Computing","volume":"21 1","pages":"81-108"},"PeriodicalIF":1.6000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038355/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11047-022-09882-6","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/3/4 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.

Abstract Image

查看原文本刊更多论文

计算图泛函学：数据结构及其应用教程。

计算庞基因组学是一个新兴的研究领域，它正在改变计算机科学家应对生物序列分析挑战的方式。在过去几十年中，组合学、弦学、图论和数据结构的贡献对于开发大量用于分析人类基因组的软件工具至关重要。这些工具使计算生物学家能够在群体规模上开展雄心勃勃的项目，如 "千人基因组计划"。千人基因组计划的一大贡献是描述了人类基因组中广泛的遗传变异，包括在南亚、非洲和欧洲人群中发现了新的变异，从而加强了参考基因组中的变异目录。目前，在个性化医疗方法中需要考虑群体基因组的高变异性和个体基因组的特异性，这正迅速促使人们放弃使用单一参考基因组的传统模式。基于图谱的多基因组表示法或图谱泛基因组正在取代线性参考基因组。这意味着要彻底重新思考分析、存储和获取基因组信息的既定程序。正确应对这些挑战对于面对雄心勃勃的医疗保健项目的计算任务至关重要，这些项目旨在通过对 100 万人进行测序来描述人类的多样性（Stark 等，2019 年）。本教程旨在向读者介绍用于表示图谱泛基因组的数据结构理论的最新进展。我们将讨论图形泛基因组中单体型和基因型变异性的高效表示，并重点介绍在解决人类和微生物（病毒）泛基因组计算问题中的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Computing Computer Science-Computer Science Applications

CiteScore

4.40

自引率

4.80%

发文量

审稿时长

3 months

期刊介绍： The journal is soliciting papers on all aspects of natural computing. Because of the interdisciplinary character of the journal a special effort will be made to solicit survey, review, and tutorial papers which would make research trends in a given subarea more accessible to the broad audience of the journal.