{"title":"Enhanced Determination of Gene Groups Based on Optimal Kernel PCA with Hierarchical Clustering Algorithm","authors":"Nwayyin Najat Mohammed, Chewan Jalal Mohammed","doi":"10.1109/CISS50987.2021.9400214","DOIUrl":null,"url":null,"abstract":"Gene expression datasets are complex and large datasets, and they are considered a rich colliery of valuable and informative genes that are associated with specific diseases. Thus, the identification of informative gene groups is a challenging task. In this study, a kernel function is determined for kernel principal component analysis for two candidate gene expression datasets to reduce the dimensionality of the datasets and to extract their most important features. The kernel functions constructed in this study are Gaussian and polynomial functions, and the optimal kernel function is chosen. The datasets are preprocessed prior to analysis. When applied to gene expression datasets, principal component analysis influences the performances of pattern detection algorithms. We use optimal kernel principal component analysis with hierarchical clustering to partition the gene expression datasets, and the proposed algorithm (KPCA-HC) results in enhanced clustering performance. The validity index used to evaluate the performance of the proposed algorithm is the adjusted rand index (ARI). The results of the proposed optimal KPCA-HC algorithm yield high validity index measures.","PeriodicalId":228112,"journal":{"name":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS50987.2021.9400214","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Gene expression datasets are complex and large datasets, and they are considered a rich colliery of valuable and informative genes that are associated with specific diseases. Thus, the identification of informative gene groups is a challenging task. In this study, a kernel function is determined for kernel principal component analysis for two candidate gene expression datasets to reduce the dimensionality of the datasets and to extract their most important features. The kernel functions constructed in this study are Gaussian and polynomial functions, and the optimal kernel function is chosen. The datasets are preprocessed prior to analysis. When applied to gene expression datasets, principal component analysis influences the performances of pattern detection algorithms. We use optimal kernel principal component analysis with hierarchical clustering to partition the gene expression datasets, and the proposed algorithm (KPCA-HC) results in enhanced clustering performance. The validity index used to evaluate the performance of the proposed algorithm is the adjusted rand index (ARI). The results of the proposed optimal KPCA-HC algorithm yield high validity index measures.