Cross-correlation based detection of contigs overlaps

Robin Jugas, Martin Vítek, K. Sedlář, Helena Skutková
{"title":"Cross-correlation based detection of contigs overlaps","authors":"Robin Jugas, Martin Vítek, K. Sedlář, Helena Skutková","doi":"10.23919/MIPRO.2018.8400030","DOIUrl":null,"url":null,"abstract":"Increasing demand for genomic data stress the development of new sequencing techniques and assembly methods. While the sequencing techniques are the biologist domain, the genome assembly is bioinformatical task and development of new assembly algorithms responds to the new sequencing methods. The final part of the assembly process is merging the contigs and find their position in the genome. Contigs are almost the final product but they can contain errors and features induced by previous assembly process. The current methods use string algorithms based on dynamic programming computing with characters (A, C, G, T) representing nucleotides, but if applied to long sequences, e.g. contigs, they tend to be time-consuming. We applied another approach based on genomic signal processing to evaluate the further merging and overlaps between the contigs. The genomic signal form of DNA sequence can reveal hidden features of sequences and digital signal processing methods can be applied. Also, the computational complexity of task can be reduced by implementing massive downsampling. We use our own implementation of cross-correlation based on Pearson correlation coefficient to detect possible overlaps between contigs, when high positive correlation indicates possible shared regions of the contigs but also to denote the position of that region, without the alignment.","PeriodicalId":431110,"journal":{"name":"2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MIPRO.2018.8400030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Increasing demand for genomic data stress the development of new sequencing techniques and assembly methods. While the sequencing techniques are the biologist domain, the genome assembly is bioinformatical task and development of new assembly algorithms responds to the new sequencing methods. The final part of the assembly process is merging the contigs and find their position in the genome. Contigs are almost the final product but they can contain errors and features induced by previous assembly process. The current methods use string algorithms based on dynamic programming computing with characters (A, C, G, T) representing nucleotides, but if applied to long sequences, e.g. contigs, they tend to be time-consuming. We applied another approach based on genomic signal processing to evaluate the further merging and overlaps between the contigs. The genomic signal form of DNA sequence can reveal hidden features of sequences and digital signal processing methods can be applied. Also, the computational complexity of task can be reduced by implementing massive downsampling. We use our own implementation of cross-correlation based on Pearson correlation coefficient to detect possible overlaps between contigs, when high positive correlation indicates possible shared regions of the contigs but also to denote the position of that region, without the alignment.
基于交叉相关的contigs重叠检测
对基因组数据日益增长的需求强调了新的测序技术和组装方法的发展。虽然测序技术属于生物学领域,但基因组组装是生物信息学任务,新的组装算法的发展响应了新的测序方法。组装过程的最后一部分是合并组群并找到它们在基因组中的位置。装配件几乎是最终产品,但它们可能包含由先前装配过程引起的错误和特征。目前的方法使用基于动态规划计算的字符串算法,用字符(A, C, G, T)表示核苷酸,但如果应用于长序列,例如contigs,它们往往是耗时的。我们采用了另一种基于基因组信号处理的方法来评估contigs之间的进一步合并和重叠。DNA序列的基因组信号形式可以揭示序列的隐藏特征,可以应用数字信号处理方法。此外,通过实现大规模下采样可以降低任务的计算复杂度。我们使用自己的基于Pearson相关系数的互相关实现来检测contigs之间可能的重叠,当高正相关表明contigs可能共享区域时,也表示该区域的位置,没有对齐。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信