Kabelo P Mokgopa, Shina D Oloniiju, Kevin A Lobb, Tendamudzimu Tshiwawa
{"title":"Benchmarking the Base Randomization Algorithm as a Possible Tool for the Initial Step of Generating a Virtual RNA Aptamers Library.","authors":"Kabelo P Mokgopa, Shina D Oloniiju, Kevin A Lobb, Tendamudzimu Tshiwawa","doi":"10.3390/biotech14030072","DOIUrl":null,"url":null,"abstract":"<p><p>While databases are emerging across various domains, from small molecules to genomics and proteins, aptamer databases remain scarce, if not entirely absent. Such databases could serve as a comprehensive resource for advancing research, innovation, and the applications of aptamer technology across multiple fields. This advancement would likely lead to improvements in healthcare, environmental monitoring, and biotechnology. Furthermore, the establishment of aptamer databases would facilitate molecular modelling and machine learning, opening doors to further advancements in understanding and utilizing aptamers. Against this backdrop, in this study, we present and benchmark the Base Randomization Algorithm (BRA) as a potential solution to the scarcity of aptamer databases. Through statistical analysis, we examine key factors such as minimum free energy (MFE), base compositions, and base arrangements. Notably, sequences generated using the BRA exhibit a Gaussian distribution pattern. We also examine the details of how each base within a sequence is chosen using mathematical principles, ensuring that the sequences are valid and optimized statistically. Additionally, we explore how the length of the randomized generated sequences can affect the folding of their structures at both the secondary and tertiary levels. Based on composition analysis, we propose that the base mean of the dataset can be approximated as x¯B≈Px × N, for dataset of sequences with the same length and x¯B≈Px × M, where M is the median and N the mean, for a dataset with randomized length that follows a Gaussian distribution.</p>","PeriodicalId":34490,"journal":{"name":"BioTech","volume":"14 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12452754/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioTech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biotech14030072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
While databases are emerging across various domains, from small molecules to genomics and proteins, aptamer databases remain scarce, if not entirely absent. Such databases could serve as a comprehensive resource for advancing research, innovation, and the applications of aptamer technology across multiple fields. This advancement would likely lead to improvements in healthcare, environmental monitoring, and biotechnology. Furthermore, the establishment of aptamer databases would facilitate molecular modelling and machine learning, opening doors to further advancements in understanding and utilizing aptamers. Against this backdrop, in this study, we present and benchmark the Base Randomization Algorithm (BRA) as a potential solution to the scarcity of aptamer databases. Through statistical analysis, we examine key factors such as minimum free energy (MFE), base compositions, and base arrangements. Notably, sequences generated using the BRA exhibit a Gaussian distribution pattern. We also examine the details of how each base within a sequence is chosen using mathematical principles, ensuring that the sequences are valid and optimized statistically. Additionally, we explore how the length of the randomized generated sequences can affect the folding of their structures at both the secondary and tertiary levels. Based on composition analysis, we propose that the base mean of the dataset can be approximated as x¯B≈Px × N, for dataset of sequences with the same length and x¯B≈Px × M, where M is the median and N the mean, for a dataset with randomized length that follows a Gaussian distribution.