Utilizing biological experimental data and molecular dynamics for the classification of mutational hotspots through machine learning.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2024-08-26 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae125

James G Davies, Georgina E Menzies

{"title":"Utilizing biological experimental data and molecular dynamics for the classification of mutational hotspots through machine learning.","authors":"James G Davies, Georgina E Menzies","doi":"10.1093/bioadv/vbae125","DOIUrl":null,"url":null,"abstract":"Motivation: Benzo[a]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[a]pyrene Diol-Epoxide (BPDE), a Benzo[a]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the TP53 gene, then applied to sites within TP53, cII, and lacZ genes.Results: We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among TP53 and lacZ duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.Availability and implementation: Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae125"},"PeriodicalIF":2.4000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11377099/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Benzo[a]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognizing specific bulky DNA adducts including Benzo[a]pyrene Diol-Epoxide (BPDE), a Benzo[a]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and nonhotspot sites within the TP53 gene, then applied to sites within TP53, cII, and lacZ genes.

Results: We show our optimized model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved among TP53 and lacZ duplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and nonhotpot sites, highlighting regional GC content as a potential biomarker for mutation.

Availability and implementation: Code for comparing machine learning classifiers and evaluating their performance is available at https://github.com/jdavies24/ML-Classifier-Comparison, and code for analysing DNA structure with Curves+ and Canal using Random Forest is available at https://github.com/jdavies24/ML-classification-of-DNA-trajectories.

查看原文本刊更多论文

利用生物实验数据和分子动力学，通过机器学习对突变热点进行分类。

动机苯并[a]芘是一种臭名昭著的破坏 DNA 的致癌物质，属于多环芳烃家族，常见于烟草烟雾中。令人惊讶的是，核苷酸切除修复（NER）机制在识别特定大块 DNA 加合物（包括苯并[a]芘代谢物--苯并[a]芘二醇环氧化物（BPDE））方面表现出低效。虽然序列上下文正在成为导致 NER 对 BPDE 加合物反应不充分的主要因素，但人们对支配这些差异的精确结构属性仍然了解不足。因此，我们结合分子动力学和机器学习领域，对 BPDE-鸟嘌呤加合物在多种基因背景下引起的螺旋变形进行了全面评估。具体来说，我们采用了一种双重方法，包括基于随机森林分类的分析和随后的特征选择，以确定可区分不同修复能力的加合物位点的精确拓扑特征。我们使用从代表 TP53 基因中 BPDE 热点和非热点位点的双链提取的螺旋数据训练模型，然后将其应用于 TP53、cII 和 lacZ 基因中的位点：结果表明，我们的优化模型始终保持着卓越的性能，准确率、精确度和 f1 分数均超过 91%。我们的特征选择方法发现，区域碱基对旋转的明显差异对我们模型的决策起着至关重要的作用。值得注意的是，这些差异在 TP53 和 lacZ 双链体中高度一致，而且似乎受到区域 GC 含量的影响。因此，我们的研究结果表明，确实存在区分热点和非热点的保守拓扑特征，这突出表明区域 GC 含量是突变的潜在生物标志物：比较机器学习分类器并评估其性能的代码可在 https://github.com/jdavies24/ML-Classifier-Comparison 网站上获取，使用 Curves+ 分析 DNA 结构以及使用随机森林分析运河的代码可在 https://github.com/jdavies24/ML-classification-of-DNA-trajectories 网站上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量