Abid Hasan, G. M. Maruf, Shareef, H. A. A. Mamun, Paul Kawn
{"title":"基于基因特征排序的微阵列数据癌症分类","authors":"Abid Hasan, G. M. Maruf, Shareef, H. A. A. Mamun, Paul Kawn","doi":"10.5958/J.2249-3212.1.2.2","DOIUrl":null,"url":null,"abstract":"A significant challenge in DNA (Deoxyribo Nucleic Acid) microarray analysis can be attributed to the problem of having a large number of features (genes) but with a small number of samples in the dataset. When applying statistical methods to analyse the microarray data, particular care is required to deal with problem such as the low classification accuracy of models brought about by the small number of features that have predictive capability. To overcome these problems, proper approaches for data normalisation, feature reduction, and identifying the optimal set of genes are critical. In this paper, we apply the Gene Feature Ranking [5] method to select genes with high trust values from high dimensional cancer microarray datasets. Our contribution lies in the use of a different metric for calculating the trust values that are more domain specific for cancer datasets. By choosing a pre-defined threshold based on user's knowledge, only genes that show sufficient trustworthiness to be considered for constructing the classification model are retained. Through experimentation on three microarray datasets, namely Acute Lymphoblastic Leukemia (ALL), lymph node negative primary breast cancer, and High Grade Glioma, we are able to confirm that the classification accuracy obtained by the genes selected by the modified GFR method is consistently higher than when the method was not used.","PeriodicalId":433348,"journal":{"name":"Pearl: A Journal of Library and Information Science","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Cancer Classification from Microarray Data using Gene Feature Ranking\",\"authors\":\"Abid Hasan, G. M. Maruf, Shareef, H. A. A. Mamun, Paul Kawn\",\"doi\":\"10.5958/J.2249-3212.1.2.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A significant challenge in DNA (Deoxyribo Nucleic Acid) microarray analysis can be attributed to the problem of having a large number of features (genes) but with a small number of samples in the dataset. When applying statistical methods to analyse the microarray data, particular care is required to deal with problem such as the low classification accuracy of models brought about by the small number of features that have predictive capability. To overcome these problems, proper approaches for data normalisation, feature reduction, and identifying the optimal set of genes are critical. In this paper, we apply the Gene Feature Ranking [5] method to select genes with high trust values from high dimensional cancer microarray datasets. Our contribution lies in the use of a different metric for calculating the trust values that are more domain specific for cancer datasets. By choosing a pre-defined threshold based on user's knowledge, only genes that show sufficient trustworthiness to be considered for constructing the classification model are retained. Through experimentation on three microarray datasets, namely Acute Lymphoblastic Leukemia (ALL), lymph node negative primary breast cancer, and High Grade Glioma, we are able to confirm that the classification accuracy obtained by the genes selected by the modified GFR method is consistently higher than when the method was not used.\",\"PeriodicalId\":433348,\"journal\":{\"name\":\"Pearl: A Journal of Library and Information Science\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pearl: A Journal of Library and Information Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5958/J.2249-3212.1.2.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pearl: A Journal of Library and Information Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5958/J.2249-3212.1.2.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cancer Classification from Microarray Data using Gene Feature Ranking
A significant challenge in DNA (Deoxyribo Nucleic Acid) microarray analysis can be attributed to the problem of having a large number of features (genes) but with a small number of samples in the dataset. When applying statistical methods to analyse the microarray data, particular care is required to deal with problem such as the low classification accuracy of models brought about by the small number of features that have predictive capability. To overcome these problems, proper approaches for data normalisation, feature reduction, and identifying the optimal set of genes are critical. In this paper, we apply the Gene Feature Ranking [5] method to select genes with high trust values from high dimensional cancer microarray datasets. Our contribution lies in the use of a different metric for calculating the trust values that are more domain specific for cancer datasets. By choosing a pre-defined threshold based on user's knowledge, only genes that show sufficient trustworthiness to be considered for constructing the classification model are retained. Through experimentation on three microarray datasets, namely Acute Lymphoblastic Leukemia (ALL), lymph node negative primary breast cancer, and High Grade Glioma, we are able to confirm that the classification accuracy obtained by the genes selected by the modified GFR method is consistently higher than when the method was not used.