The Invariance of Spectral-Kolmogorov-Type Statistics for Estimating Genomic Similarity

Micah Thornton
{"title":"The Invariance of Spectral-Kolmogorov-Type Statistics for Estimating Genomic Similarity","authors":"Micah Thornton","doi":"10.1109/ISMVL.2019.00021","DOIUrl":null,"url":null,"abstract":"Accurate and efficient comparison of genetic sequences is an important undertaking that has applications in medicine as well as informing the hierarchical clustering of organisms. Genomic comparison is important for full genomic sequences across individuals of the same or different species as in phylogenies and also within organisms as in pedigrees. Given the enormity of the different genomes and their respective sizes, such comparisons are well-known to be computationally intensive and we are motivated to find more efficient and accurate means for the genomic comparison problem. This paper introduces a metric that is computed via the proposed methodology of comparing the empirical distributions of the observed k-mers among one or more genetic sequences. This metric is in fact a Kolmogorov-Smirnoff-like statistic since it is the supremum of differences in the empirical distribution functions. Specifically, genetic sequences are represented as quaternary or radix-4 encoded sequences that allow the metric to be computed and the metric is shown to produce similar clusterings when computed via spectral coefficients. Further, we investigate the use of spectral methods, in particular the Walsh-Hadamard spectrum, of the quaternary-encoded genetic sequence and observe computed maximal spectral densities as a basis of comparison. The invariance of the Kolmogorov-Smirnoff-like statistic when it is computed in the Walsh-Hadamard domain can enable faster comparison computations through the use of spectral properties. For example, the convolution of two sequences becomes a simple multiplication in the spectral domain.","PeriodicalId":329986,"journal":{"name":"2019 IEEE 49th International Symposium on Multiple-Valued Logic (ISMVL)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 49th International Symposium on Multiple-Valued Logic (ISMVL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISMVL.2019.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Accurate and efficient comparison of genetic sequences is an important undertaking that has applications in medicine as well as informing the hierarchical clustering of organisms. Genomic comparison is important for full genomic sequences across individuals of the same or different species as in phylogenies and also within organisms as in pedigrees. Given the enormity of the different genomes and their respective sizes, such comparisons are well-known to be computationally intensive and we are motivated to find more efficient and accurate means for the genomic comparison problem. This paper introduces a metric that is computed via the proposed methodology of comparing the empirical distributions of the observed k-mers among one or more genetic sequences. This metric is in fact a Kolmogorov-Smirnoff-like statistic since it is the supremum of differences in the empirical distribution functions. Specifically, genetic sequences are represented as quaternary or radix-4 encoded sequences that allow the metric to be computed and the metric is shown to produce similar clusterings when computed via spectral coefficients. Further, we investigate the use of spectral methods, in particular the Walsh-Hadamard spectrum, of the quaternary-encoded genetic sequence and observe computed maximal spectral densities as a basis of comparison. The invariance of the Kolmogorov-Smirnoff-like statistic when it is computed in the Walsh-Hadamard domain can enable faster comparison computations through the use of spectral properties. For example, the convolution of two sequences becomes a simple multiplication in the spectral domain.
估计基因组相似性的光谱kolmogorov型统计量的不变性
准确和有效的基因序列比较是一项重要的工作,在医学上有应用,以及告知生物的层次聚类。在系统发育学中,对同一物种或不同物种的个体进行全基因组序列比较很重要,在系谱学中也很重要。考虑到不同基因组的巨大和各自的大小,这种比较是众所周知的计算密集型的,我们有动力为基因组比较问题找到更有效和准确的方法。本文介绍了通过比较一个或多个基因序列中观察到的k-mers的经验分布所提出的方法计算的度量。这个度量实际上是Kolmogorov-Smirnoff-like统计量,因为它是经验分布函数中差异的最大值。具体地说,基因序列被表示为四进制或基数4编码序列,这允许度量被计算,并且当通过谱系数计算时,度量被显示产生类似的聚类。此外,我们研究了光谱方法的使用,特别是沃尔什-阿达玛德光谱,第四系编码的基因序列,并观察计算最大光谱密度作为比较的基础。在Walsh-Hadamard域中计算kolmogorov - smirnoff -类统计量时的不变性可以通过使用谱特性实现更快的比较计算。例如,两个序列的卷积在谱域中变成了一个简单的乘法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信