{"title":"Extracting the Co-occurrences of DNA Maximal Repeats in both Human and Viruses","authors":"Jing-doo Wang, Yi-Chun Wang, Rouh-Mei Hu, J. Tsai","doi":"10.1109/BIBE.2017.00-70","DOIUrl":null,"url":null,"abstract":"This paper aims to extract significant DNA sequences appearing in both the genomes of human and viruses. To extract the co-occurrences of DNA sequences as long as possible, this study adopts a scalable approach that is based on hadoop mapreduce programming model and can extract maximal repeats; meanwhile compute the class frequency distribution of these repeats. The genomes of human and all of 4,388 viruses available in NCBI are downloaded at 2017/1/14. To take the taxonomy of viruses into consideration for further observation, only the 2,712 viruses that had been named with genus are selected from those 4,388 viruses. In this study, the taxonomic level “genus” is as the units (classes) when comparing viruses and human for experiments. Experimental results show that the longest DNA sequence appearing in both human and viruses extracted in this study is 463 base pair (bp), and that sequence, consisting of tandem repeats as “(CTAACC)n”, appears in the 5th human chromosome and virus “Human herpesvirus 6B”. It may be attractive for virologists to have further research why there exists such a long DNA fragment existed in both human and that virus. Indeed, this study may provide a new direction for genomic sequences comparison across classesthat can provide clues to inspect the existence of the relationship between these DNA maximal repeats (genotypes) with biased class frequency distribution and the features of classes (phenotypes).","PeriodicalId":262603,"journal":{"name":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2017.00-70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This paper aims to extract significant DNA sequences appearing in both the genomes of human and viruses. To extract the co-occurrences of DNA sequences as long as possible, this study adopts a scalable approach that is based on hadoop mapreduce programming model and can extract maximal repeats; meanwhile compute the class frequency distribution of these repeats. The genomes of human and all of 4,388 viruses available in NCBI are downloaded at 2017/1/14. To take the taxonomy of viruses into consideration for further observation, only the 2,712 viruses that had been named with genus are selected from those 4,388 viruses. In this study, the taxonomic level “genus” is as the units (classes) when comparing viruses and human for experiments. Experimental results show that the longest DNA sequence appearing in both human and viruses extracted in this study is 463 base pair (bp), and that sequence, consisting of tandem repeats as “(CTAACC)n”, appears in the 5th human chromosome and virus “Human herpesvirus 6B”. It may be attractive for virologists to have further research why there exists such a long DNA fragment existed in both human and that virus. Indeed, this study may provide a new direction for genomic sequences comparison across classesthat can provide clues to inspect the existence of the relationship between these DNA maximal repeats (genotypes) with biased class frequency distribution and the features of classes (phenotypes).