GMFOLD: Subgraph matching for high-throughput DNA-aptamer secondary structure classification and machine learning interpretability

IF 1.8 4区数学 Q2 BIOLOGY

Mathematical Biosciences Pub Date : 2025-06-27 DOI:10.1016/j.mbs.2025.109485

Paolo Climaco , Noelle M. Mitchell , Matthew J. Tyler , Kyungae Yang , Anne M. Andrews , Andrea L. Bertozzi

{"title":"GMFOLD: Subgraph matching for high-throughput DNA-aptamer secondary structure classification and machine learning interpretability","authors":"Paolo Climaco , Noelle M. Mitchell , Matthew J. Tyler , Kyungae Yang , Anne M. Andrews , Andrea L. Bertozzi","doi":"10.1016/j.mbs.2025.109485","DOIUrl":null,"url":null,"abstract":"<div><div>Aptamers are oligonucleotide receptors that bind to their targets with high affinity. Here, we consider aptamers comprised of single-stranded DNA that undergo target-binding-induced conformational changes, giving rise to unique secondary and tertiary structures. Given a specific aptamer primary sequence, there are well-established computational tools (notably mfold) to predict the secondary structure via free energy minimization algorithms. While mfold generates secondary structures for individual sequences, there is a need for a high-throughput process whereby thousands of DNA structures can be predicted in real-time for use in an interactive setting, when combined with aptamer selections that generate candidate pools that are too large to be experimentally interrogated. We developed a new Python code for high-throughput aptamer secondary structure determination (GMfold). GMfold uses subgraph matching methods to group aptamer candidates by secondary structure similarities. We also improve an open-source code, SeqFold, to incorporate subgraph matching concepts. We represent each secondary structure as a lowest-energy bipartite subgraph matching of the DNA graph to itself. These new tools enable thousands of DNA sequences to be compared based on their secondary structures, using machine-learning algorithms. This process is advantageous when analyzing sequences that arise from aptamer selections via systematic evolution of ligands by exponential enrichment (SELEX). This work is a building block for future machine-learning-informed DNA-aptamer selection processes to identify aptamers with improved target affinity and selectivity and advance aptamer biosensors and therapeutics.</div></div>","PeriodicalId":51119,"journal":{"name":"Mathematical Biosciences","volume":"387 ","pages":"Article 109485"},"PeriodicalIF":1.8000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biosciences","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0025556425001117","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Aptamers are oligonucleotide receptors that bind to their targets with high affinity. Here, we consider aptamers comprised of single-stranded DNA that undergo target-binding-induced conformational changes, giving rise to unique secondary and tertiary structures. Given a specific aptamer primary sequence, there are well-established computational tools (notably mfold) to predict the secondary structure via free energy minimization algorithms. While mfold generates secondary structures for individual sequences, there is a need for a high-throughput process whereby thousands of DNA structures can be predicted in real-time for use in an interactive setting, when combined with aptamer selections that generate candidate pools that are too large to be experimentally interrogated. We developed a new Python code for high-throughput aptamer secondary structure determination (GMfold). GMfold uses subgraph matching methods to group aptamer candidates by secondary structure similarities. We also improve an open-source code, SeqFold, to incorporate subgraph matching concepts. We represent each secondary structure as a lowest-energy bipartite subgraph matching of the DNA graph to itself. These new tools enable thousands of DNA sequences to be compared based on their secondary structures, using machine-learning algorithms. This process is advantageous when analyzing sequences that arise from aptamer selections via systematic evolution of ligands by exponential enrichment (SELEX). This work is a building block for future machine-learning-informed DNA-aptamer selection processes to identify aptamers with improved target affinity and selectivity and advance aptamer biosensors and therapeutics.

查看原文本刊更多论文

GMFOLD：用于高通量dna适体二级结构分类和机器学习可解释性的子图匹配。

适配体是一种寡核苷酸受体，能以高亲和力与靶标结合。在这里，我们考虑由单链DNA组成的适体，经过靶结合诱导的构象变化，产生独特的二级和三级结构。给定特定的适配体一级序列，有完善的计算工具（特别是mfold）通过自由能最小化算法来预测二级结构。当mfold为单个序列生成二级结构时，需要一个高通量的过程，以便在交互式设置中实时预测数千个DNA结构，当与适体选择相结合时，产生的候选池太大而无法进行实验查询。我们开发了一个新的Python代码用于高通量适配体二级结构确定（GMfold）。GMfold采用子图匹配方法，根据二级结构相似性对候选适配体进行分组。我们还改进了一个开源代码SeqFold，以纳入子图匹配的概念。我们将每个二级结构表示为DNA图与自身匹配的最低能量二部子图。这些新工具可以使用机器学习算法，根据它们的二级结构对数千个DNA序列进行比较。当分析通过配体的系统进化通过指数富集（SELEX）的适体选择产生的序列时，该过程是有利的。这项工作是未来基于机器学习的dna适体选择过程的基石，以识别具有更高靶标亲和力和选择性的适体，并推进适体生物传感器和治疗方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Mathematical Biosciences 生物-生物学

CiteScore

7.50

自引率

2.30%

发文量

审稿时长

18 days

期刊介绍： Mathematical Biosciences publishes work providing new concepts or new understanding of biological systems using mathematical models, or methodological articles likely to find application to multiple biological systems. Papers are expected to present a major research finding of broad significance for the biological sciences, or mathematical biology. Mathematical Biosciences welcomes original research articles, letters, reviews and perspectives.