线性时间最小分割使可扩展的创始人重建。

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2019-05-17 eCollection Date: 2019-01-01 DOI:10.1186/s13015-019-0147-6

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen

{"title":"线性时间最小分割使可扩展的创始人重建。","authors":"Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen","doi":"10.1186/s13015-019-0147-6","DOIUrl":null,"url":null,"abstract":"Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set <math><mrow><mi>R</mi> <mo>=</mo> <mo>{</mo> <msub><mi>R</mi> <mn>1</mn></msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub><mi>R</mi> <mi>m</mi></msub> <mo>}</mo></mrow> </math> of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> has length at least L and the number <math><mrow><mi>d</mi> <mrow><mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo></mrow> <mo>=</mo> <mo>|</mo> <mo>{</mo> <msub><mi>R</mi> <mi>i</mi></msub> <mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo></mrow> <mo>:</mo> <mn>1</mn> <mo>≤</mo> <mi>i</mi> <mo>≤</mo> <mi>m</mi> <mo>}</mo> <mo>|</mo></mrow> </math> of distinct substrings at segment [a, b] is minimized over <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> . The distinct substrings in the segments represent founder blocks that can be concatenated to form <math><mrow><mo>max</mo> <mo>{</mo> <mi>d</mi> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo> <mo>:</mo> <mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi> <mo>}</mo></mrow> </math> founder sequences representing the original <math><mi>R</mi></math> such that crossovers happen only at segment boundaries.Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier <math><mrow><mi>O</mi> <mo>(</mo> <mi>m</mi> <msup><mi>n</mi> <mn>2</mn></msup> <mo>)</mo></mrow> </math> .Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"12"},"PeriodicalIF":1.7000,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0147-6","citationCount":"13","resultStr":"{\"title\":\"Linear time minimum segmentation enables scalable founder reconstruction.\",\"authors\":\"Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen\",\"doi\":\"10.1186/s13015-019-0147-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set <math><mrow><mi>R</mi> <mo>=</mo> <mo>{</mo> <msub><mi>R</mi> <mn>1</mn></msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub><mi>R</mi> <mi>m</mi></msub> <mo>}</mo></mrow> </math> of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> has length at least L and the number <math><mrow><mi>d</mi> <mrow><mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo></mrow> <mo>=</mo> <mo>|</mo> <mo>{</mo> <msub><mi>R</mi> <mi>i</mi></msub> <mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo></mrow> <mo>:</mo> <mn>1</mn> <mo>≤</mo> <mi>i</mi> <mo>≤</mo> <mi>m</mi> <mo>}</mo> <mo>|</mo></mrow> </math> of distinct substrings at segment [a, b] is minimized over <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> . The distinct substrings in the segments represent founder blocks that can be concatenated to form <math><mrow><mo>max</mo> <mo>{</mo> <mi>d</mi> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo> <mo>:</mo> <mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi> <mo>}</mo></mrow> </math> founder sequences representing the original <math><mi>R</mi></math> such that crossovers happen only at segment boundaries.Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier <math><mrow><mi>O</mi> <mo>(</mo> <mi>m</mi> <msup><mi>n</mi> <mn>2</mn></msup> <mo>)</mo></mrow> </math> .Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.\",\"PeriodicalId\":50823,\"journal\":{\"name\":\"Algorithms for Molecular Biology\",\"volume\":\" \",\"pages\":\"12\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2019-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1186/s13015-019-0147-6\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Algorithms for Molecular Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13015-019-0147-6\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2019/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-019-0147-6","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 13

摘要

背景:我们研究了与泛基因组分析相关的预处理程序:考虑一组完整的人类染色体的对齐单倍型序列。由于这类数据的巨大规模，人们希望用几个尽可能保留原始序列的邻接性的创始序列来表示这个输入集。这样一个较小的集合提供了一种可扩展的方法来进一步分析泛基因组信息(例如读取比对和变体调用)。优化方正集是一个np困难问题，但存在一个可以在多项式时间内解决的分割公式，定义如下:给定一组阈值L和R = {R 1,…,R m} m的字符串(单体型序列),各有长度n,创始人重建的最小分割问题是(1,n)分割成组P等分离段,每个段[a, b]∈P L长度至少和d (a, b) = | {R我[a, b]: 1≤≤我}|截然不同的子字符串在段[a, b]最小化[a, b]∈P。段中不同的子字符串代表奠基块，这些奠基块可以连接形成max {d (a, b): [a, b]∈P}奠基序列，代表原始R，使得交叉只发生在段边界。结果:我们给出了一个O(mn)时间(即输入大小的线性时间)算法来解决奠基人重建的最小分割问题，改进了早期的O(mn) 2)。结论:我们的改进使该公式能够应用于数千个完整的人类染色体的输入。实现了新算法，并对其实用性进行了实验验证。该实现可在https://github.com/tsnorri/founder-sequences中获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Linear time minimum segmentation enables scalable founder reconstruction.

查看原文本刊更多论文

Linear time minimum segmentation enables scalable founder reconstruction.

Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set $R = {R_{1}, \dots, R_{m}}$ of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment $[a, b] \in P$ has length at least L and the number $d (a, b) = | {R_{i} [a, b] : 1 \leq i \leq m} |$ of distinct substrings at segment [a, b] is minimized over $[a, b] \in P$ . The distinct substrings in the segments represent founder blocks that can be concatenated to form $max {d (a, b) : [a, b] \in P}$ founder sequences representing the original $R$ such that crossovers happen only at segment boundaries.

Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier $O (m n^{2})$ .

Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.