Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network

Rixin Chen, Ruoxi Dai, Mingye Wang
{"title":"Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network","authors":"Rixin Chen, Ruoxi Dai, Mingye Wang","doi":"10.4236/jilsa.2020.121001","DOIUrl":null,"url":null,"abstract":"Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.","PeriodicalId":69452,"journal":{"name":"智能学习系统与应用(英文)","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"智能学习系统与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/jilsa.2020.121001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.
转录因子结合区预测:基于卷积神经网络的Word2Vec技术
全基因组表观基因组数据集使我们能够验证基序的生物学功能,并更全面地了解其调控机制。不同的基序如何决定转录因子(tf)能否在特定位置与DNA结合是一个关键的研究问题。在这个项目中,我们应用了在自然语言处理(NLP)中使用的计算技术来预测给定基序实例的转录因子结合区(TFBRs)。现有的基于深度神经网络的基序预测方法大多采用单热编码的碱基序列作为输入特征来实现tfbr的识别,存在低分辨率和间接结合机制。然而,结合位点上的基序是如何集体作用的,尚不清楚。在我们的管道中,我们基于ENCODE数据集,采用Word2Vec算法,以motif的名称作为输入,利用卷积神经网络(CNN)实现二值分类来预测tfbr。在这方面,我们将不同类型的基元视为单独的“词”,将其对应的TFBR视为“句”的意义。一个“句子”本身就是这些母题的组合,所有的“句子”都是由整个“段落”组成的。对于每个结合位点,我们在不同的细胞类型中进行二元分类,以显示我们的模型在不同结合位点和细胞类型中的性能。每个“词”在高维上都有一个对应的向量,并且每个向量之间的距离可以计算出来,因此我们可以从我们的模型中提取每个motif之间的相似度,以及明确的绑定机制。利用卷积神经网络(CNN)对Word2Vec算法提取的motif向量进行映射和池化过程中的特征提取,峰值准确率达到87%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
135
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信