{"title":"人类基因组中i-Motif候选序列的优先排序。","authors":"Veronica Remori, Michela Prest, Mauro Fasano","doi":"10.3389/fbinf.2025.1657841","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>i-Motifs (iMs) are cytosine-rich, four-stranded DNA structures with emerging roles in gene regulation and genome stability. Despite their biological relevance, genome-wide prediction of iM-forming sequences remains limited by low specificity and high false-positive rates, leading to considerable experimental burden.</p><p><strong>Method: </strong>To address this, we developed a refined computational approach that prioritizes high-confidence iM candidates using a Position-Specific Similarity Matrix (PSSM) derived from multiple sequence alignments. The human reference genome (hg38) was scanned using a custom regular expression targeting cytosine-rich motifs, followed by scoring each sequence with the PSSM. Statistical significance was assessed via permutation testing, one-sided t-tests, Benjamini-Hochberg correction, and Z-scores.</p><p><strong>Results: </strong>This pipeline identified 37,075 candidate sequences (15-46 nucleotides) with strong iM-forming potential. Validation against experimentally confirmed iMs and known G-quadruplexes (G4s) demonstrated significant differences in alignment scores and sequence similarity, confirming structural specificity. A random forest classifier trained on nucleotide features further supported the distinctiveness of the candidates, achieving a high classification performance.</p><p><strong>Conclusion: </strong>This work presents a scalable and statistically robust method to enrich for biologically relevant iM sequences, providing a valuable resource for future experimental validation and the rational design of ligands targeting iMs to modulate gene expression in contexts such as cancer.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1657841"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12378704/pdf/","citationCount":"0","resultStr":"{\"title\":\"Sequence-based prioritization of i-Motif candidates in the human genome.\",\"authors\":\"Veronica Remori, Michela Prest, Mauro Fasano\",\"doi\":\"10.3389/fbinf.2025.1657841\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>i-Motifs (iMs) are cytosine-rich, four-stranded DNA structures with emerging roles in gene regulation and genome stability. Despite their biological relevance, genome-wide prediction of iM-forming sequences remains limited by low specificity and high false-positive rates, leading to considerable experimental burden.</p><p><strong>Method: </strong>To address this, we developed a refined computational approach that prioritizes high-confidence iM candidates using a Position-Specific Similarity Matrix (PSSM) derived from multiple sequence alignments. The human reference genome (hg38) was scanned using a custom regular expression targeting cytosine-rich motifs, followed by scoring each sequence with the PSSM. Statistical significance was assessed via permutation testing, one-sided t-tests, Benjamini-Hochberg correction, and Z-scores.</p><p><strong>Results: </strong>This pipeline identified 37,075 candidate sequences (15-46 nucleotides) with strong iM-forming potential. Validation against experimentally confirmed iMs and known G-quadruplexes (G4s) demonstrated significant differences in alignment scores and sequence similarity, confirming structural specificity. A random forest classifier trained on nucleotide features further supported the distinctiveness of the candidates, achieving a high classification performance.</p><p><strong>Conclusion: </strong>This work presents a scalable and statistically robust method to enrich for biologically relevant iM sequences, providing a valuable resource for future experimental validation and the rational design of ligands targeting iMs to modulate gene expression in contexts such as cancer.</p>\",\"PeriodicalId\":73066,\"journal\":{\"name\":\"Frontiers in bioinformatics\",\"volume\":\"5 \",\"pages\":\"1657841\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12378704/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fbinf.2025.1657841\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1657841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
Sequence-based prioritization of i-Motif candidates in the human genome.
Introduction: i-Motifs (iMs) are cytosine-rich, four-stranded DNA structures with emerging roles in gene regulation and genome stability. Despite their biological relevance, genome-wide prediction of iM-forming sequences remains limited by low specificity and high false-positive rates, leading to considerable experimental burden.
Method: To address this, we developed a refined computational approach that prioritizes high-confidence iM candidates using a Position-Specific Similarity Matrix (PSSM) derived from multiple sequence alignments. The human reference genome (hg38) was scanned using a custom regular expression targeting cytosine-rich motifs, followed by scoring each sequence with the PSSM. Statistical significance was assessed via permutation testing, one-sided t-tests, Benjamini-Hochberg correction, and Z-scores.
Results: This pipeline identified 37,075 candidate sequences (15-46 nucleotides) with strong iM-forming potential. Validation against experimentally confirmed iMs and known G-quadruplexes (G4s) demonstrated significant differences in alignment scores and sequence similarity, confirming structural specificity. A random forest classifier trained on nucleotide features further supported the distinctiveness of the candidates, achieving a high classification performance.
Conclusion: This work presents a scalable and statistically robust method to enrich for biologically relevant iM sequences, providing a valuable resource for future experimental validation and the rational design of ligands targeting iMs to modulate gene expression in contexts such as cancer.