基于数据划分的DNA基序分析集成框架

2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE) Pub Date : 2016-10-01 DOI:10.1109/BIBE.2016.68

Nung Kion Lee, Allen Chieng Hoon Choong, Norshafrina Omar

{"title":"基于数据划分的DNA基序分析集成框架","authors":"Nung Kion Lee, Allen Chieng Hoon Choong, Norshafrina Omar","doi":"10.1109/BIBE.2016.68","DOIUrl":null,"url":null,"abstract":"This paper proposes an ensemble approach based on data partitioning for large-scale DNA motif analysis. Motif prediction using genome-scale dataset is challenging due to high time and space complexity. Existing ensemble approaches, while demonstrated improve performances, are only applicable to small datasets. Our approach called ENSPART first partitions the input dataset into non-overlapping subsets which serve as input to multiple distinct motif prediction tools. It is assumed that the core motifs of a transcription factor protein exists in all data subsets. We employed seven motif prediction tools to obtain initial candidate motifs and they are merged according to their sequence content similarity. An alignment-free method is used to establish motif similarity. A novel motifs merging method is proposed to merge similar motifs obtained by tools in different data partitions. Ten genome-wide ChIP datasets are collected for evaluation. We compare our approach with MEME-ChIP and obtained improved results for nine out of ten of the datasets in terms of Area Under Curve (AUC). Most datasets obtained improved AUC value between 5 to 10%. Our approach shows the promising of data partitioning based ensemble approach for large-scale motif prediction.","PeriodicalId":377504,"journal":{"name":"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"ENSPART: An Ensemble Framework Based on Data Partitioning for DNA Motif Analysis\",\"authors\":\"Nung Kion Lee, Allen Chieng Hoon Choong, Norshafrina Omar\",\"doi\":\"10.1109/BIBE.2016.68\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes an ensemble approach based on data partitioning for large-scale DNA motif analysis. Motif prediction using genome-scale dataset is challenging due to high time and space complexity. Existing ensemble approaches, while demonstrated improve performances, are only applicable to small datasets. Our approach called ENSPART first partitions the input dataset into non-overlapping subsets which serve as input to multiple distinct motif prediction tools. It is assumed that the core motifs of a transcription factor protein exists in all data subsets. We employed seven motif prediction tools to obtain initial candidate motifs and they are merged according to their sequence content similarity. An alignment-free method is used to establish motif similarity. A novel motifs merging method is proposed to merge similar motifs obtained by tools in different data partitions. Ten genome-wide ChIP datasets are collected for evaluation. We compare our approach with MEME-ChIP and obtained improved results for nine out of ten of the datasets in terms of Area Under Curve (AUC). Most datasets obtained improved AUC value between 5 to 10%. Our approach shows the promising of data partitioning based ensemble approach for large-scale motif prediction.\",\"PeriodicalId\":377504,\"journal\":{\"name\":\"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2016.68\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2016.68","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

提出了一种基于数据划分的集成方法，用于大规模DNA基序分析。由于时间和空间复杂性高，使用基因组尺度数据集进行基序预测具有挑战性。现有的集成方法虽然提高了性能，但只适用于小数据集。我们的方法称为ENSPART，首先将输入数据集划分为不重叠的子集，这些子集作为多个不同motif预测工具的输入。假设转录因子蛋白的核心基序存在于所有数据子集中。采用7种基序预测工具获得初始候选基序，并根据序列内容相似性对候选基序进行合并。采用无比对方法建立基序相似性。提出了一种新的基元合并方法，将不同数据分区中工具得到的相似基元进行合并。收集10个全基因组ChIP数据集进行评估。我们将我们的方法与MEME-ChIP进行了比较，并在曲线下面积(AUC)方面获得了十分之九的数据集的改进结果。大多数数据集的AUC值在5%到10%之间。我们的方法显示了基于数据划分的集成方法在大规模基序预测中的应用前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ENSPART: An Ensemble Framework Based on Data Partitioning for DNA Motif Analysis

This paper proposes an ensemble approach based on data partitioning for large-scale DNA motif analysis. Motif prediction using genome-scale dataset is challenging due to high time and space complexity. Existing ensemble approaches, while demonstrated improve performances, are only applicable to small datasets. Our approach called ENSPART first partitions the input dataset into non-overlapping subsets which serve as input to multiple distinct motif prediction tools. It is assumed that the core motifs of a transcription factor protein exists in all data subsets. We employed seven motif prediction tools to obtain initial candidate motifs and they are merged according to their sequence content similarity. An alignment-free method is used to establish motif similarity. A novel motifs merging method is proposed to merge similar motifs obtained by tools in different data partitions. Ten genome-wide ChIP datasets are collected for evaluation. We compare our approach with MEME-ChIP and obtained improved results for nine out of ten of the datasets in terms of Area Under Curve (AUC). Most datasets obtained improved AUC value between 5 to 10%. Our approach shows the promising of data partitioning based ensemble approach for large-scale motif prediction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)

自引率

0.00%

发文量