{"title":"病毒基因组序列分类的比较研究","authors":"Jing-doo Wang","doi":"10.1109/BIBE.2011.47","DOIUrl":null,"url":null,"abstract":"In this study, instead of traditional approaches to virus classification, we proposed a novel approach in the vector space model for virus classification via two types of genome sequences, DNA and CDS. For DNA sequence, in this study, the k-mer approach was adopted for pattern extraction and the entropy of the pattern frequency distribution among classes was for pattern weighting. For CDS sequence, however, the pattern extraction was based on the identification of distinctive protein functions which were formed by CDS clustering and a weighting method, similar to $tf*idf$ approach, for these protein functions was proposed. The experimental resources were download from NCBI and there were 35 classes (virus family) consisted of $1,877$ viruses selected. The highest values of classification accuracy via SVM classifier were as high as $94.7\\%$ and $91.3\\%$ via DNA and CDS sequences, respectively. This study not only proposed a novel approach for virus classification but also provided a new methodology for comparative genomic analysis.","PeriodicalId":391184,"journal":{"name":"2011 IEEE 11th International Conference on Bioinformatics and Bioengineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Comparison Study of Virus Classification by Genome Sequences\",\"authors\":\"Jing-doo Wang\",\"doi\":\"10.1109/BIBE.2011.47\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, instead of traditional approaches to virus classification, we proposed a novel approach in the vector space model for virus classification via two types of genome sequences, DNA and CDS. For DNA sequence, in this study, the k-mer approach was adopted for pattern extraction and the entropy of the pattern frequency distribution among classes was for pattern weighting. For CDS sequence, however, the pattern extraction was based on the identification of distinctive protein functions which were formed by CDS clustering and a weighting method, similar to $tf*idf$ approach, for these protein functions was proposed. The experimental resources were download from NCBI and there were 35 classes (virus family) consisted of $1,877$ viruses selected. The highest values of classification accuracy via SVM classifier were as high as $94.7\\\\%$ and $91.3\\\\%$ via DNA and CDS sequences, respectively. This study not only proposed a novel approach for virus classification but also provided a new methodology for comparative genomic analysis.\",\"PeriodicalId\":391184,\"journal\":{\"name\":\"2011 IEEE 11th International Conference on Bioinformatics and Bioengineering\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 11th International Conference on Bioinformatics and Bioengineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2011.47\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 11th International Conference on Bioinformatics and Bioengineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2011.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Comparison Study of Virus Classification by Genome Sequences
In this study, instead of traditional approaches to virus classification, we proposed a novel approach in the vector space model for virus classification via two types of genome sequences, DNA and CDS. For DNA sequence, in this study, the k-mer approach was adopted for pattern extraction and the entropy of the pattern frequency distribution among classes was for pattern weighting. For CDS sequence, however, the pattern extraction was based on the identification of distinctive protein functions which were formed by CDS clustering and a weighting method, similar to $tf*idf$ approach, for these protein functions was proposed. The experimental resources were download from NCBI and there were 35 classes (virus family) consisted of $1,877$ viruses selected. The highest values of classification accuracy via SVM classifier were as high as $94.7\%$ and $91.3\%$ via DNA and CDS sequences, respectively. This study not only proposed a novel approach for virus classification but also provided a new methodology for comparative genomic analysis.