Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy

arXiv - QuanBio - Genomics Pub Date : 2024-05-02 DOI:arxiv-2405.01715

Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes

{"title":"Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"arxiv-2405.01715","DOIUrl":null,"url":null,"abstract":"Advances in high throughput sequencing technologies provide a large number of\ngenomes to be analyzed, so computational methodologies play a crucial role in\nanalyzing and extracting knowledge from the data generated. Investigating\ngenomic mutations is critical because of their impact on chromosomal evolution,\ngenetic disorders, and diseases. It is common to adopt aligning sequences for\nanalyzing genomic variations, however, this approach can be computationally\nexpensive and potentially arbitrary in scenarios with large datasets. Here, we\npresent a novel method for identifying single nucleotide polymorphisms (SNPs)\nin DNA sequences from assembled genomes. This method uses the principle of\nmaximum entropy to select the most informative k-mers specific to the variant\nunder investigation. The use of this informative k-mer set enables the\ndetection of variant-specific mutations in comparison to a reference sequence.\nIn addition, our method offers the possibility of classifying novel sequences\nwith no need for organism-specific information. GRAMEP demonstrated high\naccuracy in both in silico simulations and analyses of real viral genomes,\nincluding Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate\nSARS-CoV-2 variant identification while demonstrating a lower computational\ncost compared to the gold-standard statistical tools. The source code for this\nproof-of-concept implementation is freely available at\nhttps://github.com/omatheuspimenta/GRAMEP.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.01715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Advances in high throughput sequencing technologies provide a large number of genomes to be analyzed, so computational methodologies play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations, however, this approach can be computationally expensive and potentially arbitrary in scenarios with large datasets. Here, we present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This method uses the principle of maximum entropy to select the most informative k-mers specific to the variant under investigation. The use of this informative k-mer set enables the detection of variant-specific mutations in comparison to a reference sequence. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of real viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to the gold-standard statistical tools. The source code for this proof-of-concept implementation is freely available at https://github.com/omatheuspimenta/GRAMEP.

查看原文本刊更多论文

利用基于最大熵原理的无配对方法 GRAMEP 鉴定基因组中的 SNPs

高通量测序技术的进步提供了大量待分析的基因组，因此计算方法在分析和从生成的数据中提取知识方面发挥着至关重要的作用。基因组突变对染色体进化、遗传疾病和疾病都有影响，因此研究基因组突变至关重要。采用序列比对分析基因组变异的方法很常见，但这种方法计算成本高，而且在数据集较大的情况下可能会出现任意性。在这里，我们提出了一种从组装基因组中识别 DNA 序列中单核苷酸多态性（SNPs）的新方法。该方法利用最大熵原理，针对所研究的变异选择信息量最大的 k-位点。此外，我们的方法还提供了对新序列进行分类的可能性，而无需生物体特异性信息。在对包括登革热、HIV 和 SARS-CoV-2 在内的真实病毒基因组进行硅模拟和分析时，GRAMEP 都表现出了很高的准确性。与黄金标准统计工具相比，我们的方法既能保持对 SARS-CoV-2 变异识别的准确性，又能降低计算成本。这一概念验证实现的源代码可在https://github.com/omatheuspimenta/GRAMEP 免费获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量