{"title":"Global exact optimisations for chloroplast structural haplotype scaffolding.","authors":"Victor Epain, Rumen Andonov","doi":"10.1186/s13015-023-00243-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited.</p><p><strong>Results: </strong>We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be [Formula: see text]-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties.</p><p><strong>Conclusions: </strong>We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"5"},"PeriodicalIF":1.5000,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11288059/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-023-00243-1","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited.
Results: We give a new formulation for the scaffolding in the case of chloroplast genomes as a discrete optimisation problem, that we prove the decision version to be [Formula: see text]-Complete. We take advantage of the knowledge of chloroplast genomes and succeed in expressing the relationships between a few specific genomic repeats in mathematical constraints. Our approach is independent of the distances and adopts a genomic regions view, with the priority on scaffolding the repeats first. In this way, we encode the structural haplotype issue in order to retrieve several genome forms that coexist in the same chloroplast cell. To solve exactly the optimisation problem, we develop an integer linear program that we implement in Python3 package khloraascaf. We test it on synthetic data to investigate its performance behaviour and its robustness against several chosen difficulties.
Conclusions: We succeed to model biological knowledge on genomic structures to scaffold chloroplast genomes. Our results suggest that modelling genomic regions is sufficient for scaffolding repeats and is suitable for finding several solutions corresponding to several genome forms.
期刊介绍:
Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning.
Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms.
Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.