Benchmarking the Base Randomization Algorithm as a Possible Tool for the Initial Step of Generating a Virtual RNA Aptamers Library.

IF 3.1 Q3 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

BioTech Pub Date : 2025-09-12 DOI:10.3390/biotech14030072

Kabelo P Mokgopa, Shina D Oloniiju, Kevin A Lobb, Tendamudzimu Tshiwawa

{"title":"Benchmarking the Base Randomization Algorithm as a Possible Tool for the Initial Step of Generating a Virtual RNA Aptamers Library.","authors":"Kabelo P Mokgopa, Shina D Oloniiju, Kevin A Lobb, Tendamudzimu Tshiwawa","doi":"10.3390/biotech14030072","DOIUrl":null,"url":null,"abstract":"<p><p>While databases are emerging across various domains, from small molecules to genomics and proteins, aptamer databases remain scarce, if not entirely absent. Such databases could serve as a comprehensive resource for advancing research, innovation, and the applications of aptamer technology across multiple fields. This advancement would likely lead to improvements in healthcare, environmental monitoring, and biotechnology. Furthermore, the establishment of aptamer databases would facilitate molecular modelling and machine learning, opening doors to further advancements in understanding and utilizing aptamers. Against this backdrop, in this study, we present and benchmark the Base Randomization Algorithm (BRA) as a potential solution to the scarcity of aptamer databases. Through statistical analysis, we examine key factors such as minimum free energy (MFE), base compositions, and base arrangements. Notably, sequences generated using the BRA exhibit a Gaussian distribution pattern. We also examine the details of how each base within a sequence is chosen using mathematical principles, ensuring that the sequences are valid and optimized statistically. Additionally, we explore how the length of the randomized generated sequences can affect the folding of their structures at both the secondary and tertiary levels. Based on composition analysis, we propose that the base mean of the dataset can be approximated as x¯B≈Px × N, for dataset of sequences with the same length and x¯B≈Px × M, where M is the median and N the mean, for a dataset with randomized length that follows a Gaussian distribution.</p>","PeriodicalId":34490,"journal":{"name":"BioTech","volume":"14 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12452754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioTech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biotech14030072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

While databases are emerging across various domains, from small molecules to genomics and proteins, aptamer databases remain scarce, if not entirely absent. Such databases could serve as a comprehensive resource for advancing research, innovation, and the applications of aptamer technology across multiple fields. This advancement would likely lead to improvements in healthcare, environmental monitoring, and biotechnology. Furthermore, the establishment of aptamer databases would facilitate molecular modelling and machine learning, opening doors to further advancements in understanding and utilizing aptamers. Against this backdrop, in this study, we present and benchmark the Base Randomization Algorithm (BRA) as a potential solution to the scarcity of aptamer databases. Through statistical analysis, we examine key factors such as minimum free energy (MFE), base compositions, and base arrangements. Notably, sequences generated using the BRA exhibit a Gaussian distribution pattern. We also examine the details of how each base within a sequence is chosen using mathematical principles, ensuring that the sequences are valid and optimized statistically. Additionally, we explore how the length of the randomized generated sequences can affect the folding of their structures at both the secondary and tertiary levels. Based on composition analysis, we propose that the base mean of the dataset can be approximated as x¯B≈Px × N, for dataset of sequences with the same length and x¯B≈Px × M, where M is the median and N the mean, for a dataset with randomized length that follows a Gaussian distribution.

查看原文本刊更多论文

基准随机化算法作为生成虚拟RNA适体库初始步骤的可能工具。

虽然从小分子到基因组学和蛋白质等各个领域都出现了数据库，但适体数据库仍然很少，如果不是完全没有的话。这种数据库可以作为促进研究、创新和跨多个领域应用适宜技术的综合资源。这一进步可能会改善医疗保健、环境监测和生物技术。此外，适体数据库的建立将促进分子建模和机器学习，为进一步了解和利用适体打开大门。在此背景下，在本研究中，我们提出并对基本随机化算法（BRA）进行基准测试，作为适体数据库稀缺的潜在解决方案。通过统计分析，我们考察了最小自由能（MFE）、碱基组成和碱基排列等关键因素。值得注意的是，使用BRA生成的序列呈现高斯分布模式。我们还研究了如何使用数学原理选择序列中的每个碱基的细节，以确保序列是有效的并在统计上进行了优化。此外，我们还探讨了随机生成序列的长度如何影响其二级和三级结构的折叠。在组成分析的基础上，我们提出数据集的基均值可以近似为x¯B≈Px × N，对于相同长度序列的数据集，x¯B≈Px × M，其中M为中位数，N为均值，对于随机长度服从高斯分布的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊