A sequence-based two-layer predictor for identifying enhancers and their strength through enhanced feature extraction

IF 0.7 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Bioinformatics and Computational Biology Pub Date : 2022-03-09 DOI:10.1142/S0219720022500056

Santhosh Amilpur, Raju Bhukya

{"title":"A sequence-based two-layer predictor for identifying enhancers and their strength through enhanced feature extraction","authors":"Santhosh Amilpur, Raju Bhukya","doi":"10.1142/S0219720022500056","DOIUrl":null,"url":null,"abstract":"Enhancers are short regulatory DNA fragments that are bound with proteins called activators. They are free-bound and distant elements, which play a vital role in controlling gene expression. It is challenging to identify enhancers and their strength due to their dynamic nature. Although some machine learning methods exist to accelerate identification process, their prediction accuracy and efficiency will need more improvement. In this regard, we propose a two-layer prediction model with enhanced feature extraction strategy which does feature combination from improved position-specific amino acid propensity (PSTKNC) method along with Enhanced Nucleic Acid Composition (ENAC) and Composition of k-spaced Nucleic Acid Pairs (CKSNAP). The feature sets from all three feature extraction approaches were concatenated and then sent through a simple artificial neural network (ANN) to accurately identify enhancers in the first layer and their strength in the second layer. Experiments are conducted on benchmark chromatin nine cell lines dataset. A 10-fold cross validation method is employed to evaluate model's performance. The results show that the proposed model gives an outstanding performance with 94.50%, 0.8903 of accuracy and Matthew's correlation coefficient (MCC) in predicting enhancers and fairly does well with independent test also when compared with all other existing methods.","PeriodicalId":48910,"journal":{"name":"Journal of Bioinformatics and Computational Biology","volume":"1 1","pages":"2250005"},"PeriodicalIF":0.7000,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Bioinformatics and Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1142/S0219720022500056","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 1

Abstract

Enhancers are short regulatory DNA fragments that are bound with proteins called activators. They are free-bound and distant elements, which play a vital role in controlling gene expression. It is challenging to identify enhancers and their strength due to their dynamic nature. Although some machine learning methods exist to accelerate identification process, their prediction accuracy and efficiency will need more improvement. In this regard, we propose a two-layer prediction model with enhanced feature extraction strategy which does feature combination from improved position-specific amino acid propensity (PSTKNC) method along with Enhanced Nucleic Acid Composition (ENAC) and Composition of k-spaced Nucleic Acid Pairs (CKSNAP). The feature sets from all three feature extraction approaches were concatenated and then sent through a simple artificial neural network (ANN) to accurately identify enhancers in the first layer and their strength in the second layer. Experiments are conducted on benchmark chromatin nine cell lines dataset. A 10-fold cross validation method is employed to evaluate model's performance. The results show that the proposed model gives an outstanding performance with 94.50%, 0.8903 of accuracy and Matthew's correlation coefficient (MCC) in predicting enhancers and fairly does well with independent test also when compared with all other existing methods.

查看原文本刊更多论文

一种基于序列的两层预测器，用于通过增强特征提取识别增强子及其强度

增强子是短的调控DNA片段，与称为激活子的蛋白质结合。它们是自由结合的远距离元件，在控制基因表达中起着至关重要的作用。由于增强剂的动态性，确定增强剂及其强度具有挑战性。虽然存在一些机器学习方法来加速识别过程，但它们的预测精度和效率还需要更多的提高。为此，我们提出了一种基于增强特征提取策略的两层预测模型，该模型将改进的位置特异性氨基酸倾向(PSTKNC)方法与增强的核酸组成(ENAC)和k间隔核酸对组成(CKSNAP)相结合。将所有三种特征提取方法的特征集连接起来，然后通过简单的人工神经网络(ANN)准确识别第一层的增强子和第二层的增强子的强度。在基准染色质9细胞系数据集上进行了实验。采用10倍交叉验证法对模型的性能进行评价。结果表明，该模型对增强子的预测精度为94.50%，准确度为0.8903，马修相关系数(MCC)为0.8903，与现有方法相比，具有较好的独立检验性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Bioinformatics and Computational Biology MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

2.10

自引率

0.00%

发文量

期刊介绍： The Journal of Bioinformatics and Computational Biology aims to publish high quality, original research articles, expository tutorial papers and review papers as well as short, critical comments on technical issues associated with the analysis of cellular information. The research papers will be technical presentations of new assertions, discoveries and tools, intended for a narrower specialist community. The tutorials, reviews and critical commentary will be targeted at a broader readership of biologists who are interested in using computers but are not knowledgeable about scientific computing, and equally, computer scientists who have an interest in biology but are not familiar with current thrusts nor the language of biology. Such carefully chosen tutorials and articles should greatly accelerate the rate of entry of these new creative scientists into the field.