转录因子结合区预测:基于卷积神经网络的Word2Vec技术

智能学习系统与应用(英文) Pub Date : 2020-01-01 DOI:10.4236/jilsa.2020.121001

Rixin Chen, Ruoxi Dai, Mingye Wang

{"title":"转录因子结合区预测:基于卷积神经网络的Word2Vec技术","authors":"Rixin Chen, Ruoxi Dai, Mingye Wang","doi":"10.4236/jilsa.2020.121001","DOIUrl":null,"url":null,"abstract":"Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.","PeriodicalId":69452,"journal":{"name":"智能学习系统与应用(英文)","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network\",\"authors\":\"Rixin Chen, Ruoxi Dai, Mingye Wang\",\"doi\":\"10.4236/jilsa.2020.121001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.\",\"PeriodicalId\":69452,\"journal\":{\"name\":\"智能学习系统与应用(英文)\",\"volume\":\"43 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"智能学习系统与应用(英文)\",\"FirstCategoryId\":\"1093\",\"ListUrlMain\":\"https://doi.org/10.4236/jilsa.2020.121001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"智能学习系统与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/jilsa.2020.121001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

全基因组表观基因组数据集使我们能够验证基序的生物学功能，并更全面地了解其调控机制。不同的基序如何决定转录因子(tf)能否在特定位置与DNA结合是一个关键的研究问题。在这个项目中，我们应用了在自然语言处理(NLP)中使用的计算技术来预测给定基序实例的转录因子结合区(TFBRs)。现有的基于深度神经网络的基序预测方法大多采用单热编码的碱基序列作为输入特征来实现tfbr的识别，存在低分辨率和间接结合机制。然而，结合位点上的基序是如何集体作用的，尚不清楚。在我们的管道中，我们基于ENCODE数据集，采用Word2Vec算法，以motif的名称作为输入，利用卷积神经网络(CNN)实现二值分类来预测tfbr。在这方面，我们将不同类型的基元视为单独的“词”，将其对应的TFBR视为“句”的意义。一个“句子”本身就是这些母题的组合，所有的“句子”都是由整个“段落”组成的。对于每个结合位点，我们在不同的细胞类型中进行二元分类，以显示我们的模型在不同结合位点和细胞类型中的性能。每个“词”在高维上都有一个对应的向量，并且每个向量之间的距离可以计算出来，因此我们可以从我们的模型中提取每个motif之间的相似度，以及明确的绑定机制。利用卷积神经网络(CNN)对Word2Vec算法提取的motif向量进行映射和池化过程中的特征提取，峰值准确率达到87%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network

Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

智能学习系统与应用(英文)

自引率

0.00%

发文量

135