Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy

Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes
{"title":"Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"arxiv-2405.01715","DOIUrl":null,"url":null,"abstract":"Advances in high throughput sequencing technologies provide a large number of\ngenomes to be analyzed, so computational methodologies play a crucial role in\nanalyzing and extracting knowledge from the data generated. Investigating\ngenomic mutations is critical because of their impact on chromosomal evolution,\ngenetic disorders, and diseases. It is common to adopt aligning sequences for\nanalyzing genomic variations, however, this approach can be computationally\nexpensive and potentially arbitrary in scenarios with large datasets. Here, we\npresent a novel method for identifying single nucleotide polymorphisms (SNPs)\nin DNA sequences from assembled genomes. This method uses the principle of\nmaximum entropy to select the most informative k-mers specific to the variant\nunder investigation. The use of this informative k-mer set enables the\ndetection of variant-specific mutations in comparison to a reference sequence.\nIn addition, our method offers the possibility of classifying novel sequences\nwith no need for organism-specific information. GRAMEP demonstrated high\naccuracy in both in silico simulations and analyses of real viral genomes,\nincluding Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate\nSARS-CoV-2 variant identification while demonstrating a lower computational\ncost compared to the gold-standard statistical tools. The source code for this\nproof-of-concept implementation is freely available at\nhttps://github.com/omatheuspimenta/GRAMEP.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.01715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP.
利用基于最大熵原理的无配对方法 GRAMEP 鉴定基因组中的 SNPs
高通量测序技术的进步提供了大量待分析的基因组,因此计算方法在分析和从生成的数据中提取知识方面发挥着至关重要的作用。基因组突变对染色体进化、遗传疾病和疾病都有影响,因此研究基因组突变至关重要。采用序列比对分析基因组变异的方法很常见,但这种方法计算成本高,而且在数据集较大的情况下可能会出现任意性。在这里,我们提出了一种从组装基因组中识别 DNA 序列中单核苷酸多态性(SNPs)的新方法。该方法利用最大熵原理,针对所研究的变异选择信息量最大的 k-位点。此外,我们的方法还提供了对新序列进行分类的可能性,而无需生物体特异性信息。在对包括登革热、HIV 和 SARS-CoV-2 在内的真实病毒基因组进行硅模拟和分析时,GRAMEP 都表现出了很高的准确性。与黄金标准统计工具相比,我们的方法既能保持对 SARS-CoV-2 变异识别的准确性,又能降低计算成本。这一概念验证实现的源代码可在https://github.com/omatheuspimenta/GRAMEP 免费获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信