Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes
{"title":"Identification of SNPs in genomes using GRAMEP, an alignment-free method based on the Principle of Maximum Entropy","authors":"Matheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins Lopes","doi":"arxiv-2405.01715","DOIUrl":null,"url":null,"abstract":"Advances in high throughput sequencing technologies provide a large number of\ngenomes to be analyzed, so computational methodologies play a crucial role in\nanalyzing and extracting knowledge from the data generated. Investigating\ngenomic mutations is critical because of their impact on chromosomal evolution,\ngenetic disorders, and diseases. It is common to adopt aligning sequences for\nanalyzing genomic variations, however, this approach can be computationally\nexpensive and potentially arbitrary in scenarios with large datasets. Here, we\npresent a novel method for identifying single nucleotide polymorphisms (SNPs)\nin DNA sequences from assembled genomes. This method uses the principle of\nmaximum entropy to select the most informative k-mers specific to the variant\nunder investigation. The use of this informative k-mer set enables the\ndetection of variant-specific mutations in comparison to a reference sequence.\nIn addition, our method offers the possibility of classifying novel sequences\nwith no need for organism-specific information. GRAMEP demonstrated high\naccuracy in both in silico simulations and analyses of real viral genomes,\nincluding Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate\nSARS-CoV-2 variant identification while demonstrating a lower computational\ncost compared to the gold-standard statistical tools. The source code for this\nproof-of-concept implementation is freely available at\nhttps://github.com/omatheuspimenta/GRAMEP.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.01715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Advances in high throughput sequencing technologies provide a large number of
genomes to be analyzed, so computational methodologies play a crucial role in
analyzing and extracting knowledge from the data generated. Investigating
genomic mutations is critical because of their impact on chromosomal evolution,
genetic disorders, and diseases. It is common to adopt aligning sequences for
analyzing genomic variations, however, this approach can be computationally
expensive and potentially arbitrary in scenarios with large datasets. Here, we
present a novel method for identifying single nucleotide polymorphisms (SNPs)
in DNA sequences from assembled genomes. This method uses the principle of
maximum entropy to select the most informative k-mers specific to the variant
under investigation. The use of this informative k-mer set enables the
detection of variant-specific mutations in comparison to a reference sequence.
In addition, our method offers the possibility of classifying novel sequences
with no need for organism-specific information. GRAMEP demonstrated high
accuracy in both in silico simulations and analyses of real viral genomes,
including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate
SARS-CoV-2 variant identification while demonstrating a lower computational
cost compared to the gold-standard statistical tools. The source code for this
proof-of-concept implementation is freely available at
https://github.com/omatheuspimenta/GRAMEP.