Jing-doo Wang, Wen-Ling Chan, Charles C. N. Wang, Jan-Gowth Chang, J. Tsai
{"title":"Mining distinctive DNA patterns from the upstream of human coding&non-coding genes via class frequency distribution","authors":"Jing-doo Wang, Wen-Ling Chan, Charles C. N. Wang, Jan-Gowth Chang, J. Tsai","doi":"10.1109/CIBCB.2016.7758114","DOIUrl":null,"url":null,"abstract":"The upstream of genes are expected to contain many still unknown regulatory regions that can increase or decrease the expression of specific genes. The processes of mining distinctive patterns (region) are to extract maximal repeats (patterns) from the upstream DNA sequences of human genes, and then filter out the patterns whose class frequency distribution can fit in with that is specified by domain experts; the class frequency distribution of one pattern is the frequencies of that pattern appearing in each of classes. The computation of extracting maximal repeats and meanwhile computing their class frequency distribution can be done by a scalable approach based on a previous work via MapReduce programming model. Experimental resources include the DNA sequences extracted from the upstream 5, 000 bp DNA sequences of 49, 267 human coding&non-coding genes. The classes of human genes are divided into four classes as “non-cancer related protein-coding gene”, “oncogene”, “tumor suppressor gene” and “non-coding genes”(RNA). Experimental results show that 17 distinctive patterns selected as core patters whose length is longer than 36 bp and, appear in more than 3, 000 genes and in all of four classes. To have more specific observation, there are 22 distinctive patterns selected that appear in at least 10 genes and whose lengths are greater than 15 bp and, most of all, just happen in two classes, “oncogene” and “tumor suppressor gene”. It is very attractive and expected to extend this approach to mine for another types of distinctive patterns, e.g. biomarkers, via this approach based on class frequency distribution of selected patterns if the targeted resources of genomic sequences, containing “genotypes”, are available and each of these sequences is labeled precisely according to the features, e.g. “phenotypes”, specified by domain experts in the future.","PeriodicalId":368740,"journal":{"name":"2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2016.7758114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The upstream of genes are expected to contain many still unknown regulatory regions that can increase or decrease the expression of specific genes. The processes of mining distinctive patterns (region) are to extract maximal repeats (patterns) from the upstream DNA sequences of human genes, and then filter out the patterns whose class frequency distribution can fit in with that is specified by domain experts; the class frequency distribution of one pattern is the frequencies of that pattern appearing in each of classes. The computation of extracting maximal repeats and meanwhile computing their class frequency distribution can be done by a scalable approach based on a previous work via MapReduce programming model. Experimental resources include the DNA sequences extracted from the upstream 5, 000 bp DNA sequences of 49, 267 human coding&non-coding genes. The classes of human genes are divided into four classes as “non-cancer related protein-coding gene”, “oncogene”, “tumor suppressor gene” and “non-coding genes”(RNA). Experimental results show that 17 distinctive patterns selected as core patters whose length is longer than 36 bp and, appear in more than 3, 000 genes and in all of four classes. To have more specific observation, there are 22 distinctive patterns selected that appear in at least 10 genes and whose lengths are greater than 15 bp and, most of all, just happen in two classes, “oncogene” and “tumor suppressor gene”. It is very attractive and expected to extend this approach to mine for another types of distinctive patterns, e.g. biomarkers, via this approach based on class frequency distribution of selected patterns if the targeted resources of genomic sequences, containing “genotypes”, are available and each of these sequences is labeled precisely according to the features, e.g. “phenotypes”, specified by domain experts in the future.
基因的上游可能包含许多未知的调控区域,这些区域可以增加或减少特定基因的表达。挖掘独特模式(区域)的过程是从人类基因的上游DNA序列中提取最大重复(模式),然后过滤出与领域专家指定的类频率分布相拟合的模式;一种模式的类频率分布是该模式在每个类中出现的频率。在MapReduce编程模型的基础上,提出了一种可扩展的方法来计算提取最大重复次数的同时计算它们的类频率分布。实验资源包括从49267个人类编码和非编码基因的上游5000 bp DNA序列中提取的DNA序列。人类基因的类别分为“非癌相关蛋白编码基因”、“致癌基因”、“抑癌基因”和“非编码基因”(RNA)四类。实验结果表明,17个独特的模式被选择为核心模式,长度超过36bp,出现在3000多个基因中,在所有四个类中。为了进行更具体的观察,我们选择了22种不同的模式,这些模式出现在至少10个基因中,它们的长度大于15 bp,而且大多数只发生在“致癌基因”和“抑癌基因”两类中。如果包含“基因型”的基因组序列的目标资源可用,并且每个序列都根据未来领域专家指定的特征(例如“表型”)精确标记,那么通过基于所选模式的类频率分布的方法,将这种方法扩展到挖掘另一种类型的独特模式,例如生物标志物,这是非常有吸引力的。