Nung Kion Lee, Allen Chieng Hoon Choong, Norshafrina Omar
{"title":"基于数据划分的DNA基序分析集成框架","authors":"Nung Kion Lee, Allen Chieng Hoon Choong, Norshafrina Omar","doi":"10.1109/BIBE.2016.68","DOIUrl":null,"url":null,"abstract":"This paper proposes an ensemble approach based on data partitioning for large-scale DNA motif analysis. Motif prediction using genome-scale dataset is challenging due to high time and space complexity. Existing ensemble approaches, while demonstrated improve performances, are only applicable to small datasets. Our approach called ENSPART first partitions the input dataset into non-overlapping subsets which serve as input to multiple distinct motif prediction tools. It is assumed that the core motifs of a transcription factor protein exists in all data subsets. We employed seven motif prediction tools to obtain initial candidate motifs and they are merged according to their sequence content similarity. An alignment-free method is used to establish motif similarity. A novel motifs merging method is proposed to merge similar motifs obtained by tools in different data partitions. Ten genome-wide ChIP datasets are collected for evaluation. We compare our approach with MEME-ChIP and obtained improved results for nine out of ten of the datasets in terms of Area Under Curve (AUC). Most datasets obtained improved AUC value between 5 to 10%. Our approach shows the promising of data partitioning based ensemble approach for large-scale motif prediction.","PeriodicalId":377504,"journal":{"name":"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"ENSPART: An Ensemble Framework Based on Data Partitioning for DNA Motif Analysis\",\"authors\":\"Nung Kion Lee, Allen Chieng Hoon Choong, Norshafrina Omar\",\"doi\":\"10.1109/BIBE.2016.68\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes an ensemble approach based on data partitioning for large-scale DNA motif analysis. Motif prediction using genome-scale dataset is challenging due to high time and space complexity. Existing ensemble approaches, while demonstrated improve performances, are only applicable to small datasets. Our approach called ENSPART first partitions the input dataset into non-overlapping subsets which serve as input to multiple distinct motif prediction tools. It is assumed that the core motifs of a transcription factor protein exists in all data subsets. We employed seven motif prediction tools to obtain initial candidate motifs and they are merged according to their sequence content similarity. An alignment-free method is used to establish motif similarity. A novel motifs merging method is proposed to merge similar motifs obtained by tools in different data partitions. Ten genome-wide ChIP datasets are collected for evaluation. We compare our approach with MEME-ChIP and obtained improved results for nine out of ten of the datasets in terms of Area Under Curve (AUC). Most datasets obtained improved AUC value between 5 to 10%. Our approach shows the promising of data partitioning based ensemble approach for large-scale motif prediction.\",\"PeriodicalId\":377504,\"journal\":{\"name\":\"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2016.68\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2016.68","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ENSPART: An Ensemble Framework Based on Data Partitioning for DNA Motif Analysis
This paper proposes an ensemble approach based on data partitioning for large-scale DNA motif analysis. Motif prediction using genome-scale dataset is challenging due to high time and space complexity. Existing ensemble approaches, while demonstrated improve performances, are only applicable to small datasets. Our approach called ENSPART first partitions the input dataset into non-overlapping subsets which serve as input to multiple distinct motif prediction tools. It is assumed that the core motifs of a transcription factor protein exists in all data subsets. We employed seven motif prediction tools to obtain initial candidate motifs and they are merged according to their sequence content similarity. An alignment-free method is used to establish motif similarity. A novel motifs merging method is proposed to merge similar motifs obtained by tools in different data partitions. Ten genome-wide ChIP datasets are collected for evaluation. We compare our approach with MEME-ChIP and obtained improved results for nine out of ten of the datasets in terms of Area Under Curve (AUC). Most datasets obtained improved AUC value between 5 to 10%. Our approach shows the promising of data partitioning based ensemble approach for large-scale motif prediction.