Sequence-based prioritization of i-Motif candidates in the human genome.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2025-08-12 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1657841

Veronica Remori, Michela Prest, Mauro Fasano

{"title":"Sequence-based prioritization of i-Motif candidates in the human genome.","authors":"Veronica Remori, Michela Prest, Mauro Fasano","doi":"10.3389/fbinf.2025.1657841","DOIUrl":null,"url":null,"abstract":"Introduction: i-Motifs (iMs) are cytosine-rich, four-stranded DNA structures with emerging roles in gene regulation and genome stability. Despite their biological relevance, genome-wide prediction of iM-forming sequences remains limited by low specificity and high false-positive rates, leading to considerable experimental burden.Method: To address this, we developed a refined computational approach that prioritizes high-confidence iM candidates using a Position-Specific Similarity Matrix (PSSM) derived from multiple sequence alignments. The human reference genome (hg38) was scanned using a custom regular expression targeting cytosine-rich motifs, followed by scoring each sequence with the PSSM. Statistical significance was assessed via permutation testing, one-sided t-tests, Benjamini-Hochberg correction, and Z-scores.Results: This pipeline identified 37,075 candidate sequences (15-46 nucleotides) with strong iM-forming potential. Validation against experimentally confirmed iMs and known G-quadruplexes (G4s) demonstrated significant differences in alignment scores and sequence similarity, confirming structural specificity. A random forest classifier trained on nucleotide features further supported the distinctiveness of the candidates, achieving a high classification performance.Conclusion: This work presents a scalable and statistically robust method to enrich for biologically relevant iM sequences, providing a valuable resource for future experimental validation and the rational design of ligands targeting iMs to modulate gene expression in contexts such as cancer.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1657841"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12378704/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1657841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: i-Motifs (iMs) are cytosine-rich, four-stranded DNA structures with emerging roles in gene regulation and genome stability. Despite their biological relevance, genome-wide prediction of iM-forming sequences remains limited by low specificity and high false-positive rates, leading to considerable experimental burden.

Method: To address this, we developed a refined computational approach that prioritizes high-confidence iM candidates using a Position-Specific Similarity Matrix (PSSM) derived from multiple sequence alignments. The human reference genome (hg38) was scanned using a custom regular expression targeting cytosine-rich motifs, followed by scoring each sequence with the PSSM. Statistical significance was assessed via permutation testing, one-sided t-tests, Benjamini-Hochberg correction, and Z-scores.

Results: This pipeline identified 37,075 candidate sequences (15-46 nucleotides) with strong iM-forming potential. Validation against experimentally confirmed iMs and known G-quadruplexes (G4s) demonstrated significant differences in alignment scores and sequence similarity, confirming structural specificity. A random forest classifier trained on nucleotide features further supported the distinctiveness of the candidates, achieving a high classification performance.

Conclusion: This work presents a scalable and statistically robust method to enrich for biologically relevant iM sequences, providing a valuable resource for future experimental validation and the rational design of ligands targeting iMs to modulate gene expression in contexts such as cancer.

查看原文本刊更多论文

人类基因组中i-Motif候选序列的优先排序。

i-Motifs （iMs）是一种富含胞嘧啶的四链DNA结构，在基因调控和基因组稳定中发挥着新的作用。尽管im形成序列具有生物学相关性，但其全基因组预测仍然受到低特异性和高假阳性率的限制，导致相当大的实验负担。方法：为了解决这个问题，我们开发了一种改进的计算方法，使用来自多个序列比对的位置特定相似性矩阵（PSSM）对高置信度的iM候选人进行优先排序。人类参考基因组（hg38）使用针对富胞嘧啶基序的定制正则表达进行扫描，然后使用PSSM对每个序列进行评分。通过排列检验、单侧t检验、Benjamini-Hochberg校正和z分数评估统计学显著性。结果：该管道鉴定出37,075个具有强im形成潜力的候选序列（15-46个核苷酸）。对实验证实的iMs和已知的g -四联体（G4s）进行验证，结果显示其比对得分和序列相似性存在显著差异，证实了结构特异性。基于核苷酸特征训练的随机森林分类器进一步支持候选候选的独特性，实现了较高的分类性能。结论：本研究提供了一种可扩展且统计可靠的方法来丰富生物学相关的iM序列，为未来的实验验证和合理设计靶向iM的配体来调节癌症等环境下的基因表达提供了宝贵的资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量