Bruno Mendes Moro Conque, A. Kashiwabara, Fabricio M. Lopes
{"title":"A feature extraction approach based on complex networks for genomic sequences recognition","authors":"Bruno Mendes Moro Conque, A. Kashiwabara, Fabricio M. Lopes","doi":"10.1109/CISP-BMEI.2016.7853010","DOIUrl":null,"url":null,"abstract":"The development of new genomic sequencing techniques leads to a generation of a huge volume of biological data. In this context, it is important to develop new pattern recognition methods and improve its accuracy in order to support the analysis of these huge volume of data. In particular, a valuable information of the genomic sequences is its nucleotides organization. This work presents an effective feature extraction approach for genomic sequences from complex networks, which is based on mapping the genomic sequences in its representation as complex networks. The nodes of the networks are defined by the combination of nucleotides, dinucleotides or trinucleotides within the sequence by adopting the parameters: Word Size (W S) and Step (ST). The edges are estimated by observing the respective adjacency among the nucleotides in the genomic sequence. These complex network measures are extracted and adopted in order to generate a feature vector for each genomic sequence. For each biological sequence, the entropy, sum of entropy and its maximum value are also adopted. A dataset containing 3 different genomic sequences: coding, intergenic and TSS (Transcriptional Starter Sites) were adopted in order to evaluate the proposed approach. The results were obtained by the following classification methods: Random Forest with 91.2%, followed by J48 with 89.1% and SVM with 84.8% of accuracy without including any source of a priori information, i.e., considering only the genomic sequences. These results indicate the suitability, effectiveness and robustness of the proposed feature extraction approach for the classification of the adopted classes of genomic sequences.","PeriodicalId":275095,"journal":{"name":"2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISP-BMEI.2016.7853010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The development of new genomic sequencing techniques leads to a generation of a huge volume of biological data. In this context, it is important to develop new pattern recognition methods and improve its accuracy in order to support the analysis of these huge volume of data. In particular, a valuable information of the genomic sequences is its nucleotides organization. This work presents an effective feature extraction approach for genomic sequences from complex networks, which is based on mapping the genomic sequences in its representation as complex networks. The nodes of the networks are defined by the combination of nucleotides, dinucleotides or trinucleotides within the sequence by adopting the parameters: Word Size (W S) and Step (ST). The edges are estimated by observing the respective adjacency among the nucleotides in the genomic sequence. These complex network measures are extracted and adopted in order to generate a feature vector for each genomic sequence. For each biological sequence, the entropy, sum of entropy and its maximum value are also adopted. A dataset containing 3 different genomic sequences: coding, intergenic and TSS (Transcriptional Starter Sites) were adopted in order to evaluate the proposed approach. The results were obtained by the following classification methods: Random Forest with 91.2%, followed by J48 with 89.1% and SVM with 84.8% of accuracy without including any source of a priori information, i.e., considering only the genomic sequences. These results indicate the suitability, effectiveness and robustness of the proposed feature extraction approach for the classification of the adopted classes of genomic sequences.