{"title":"评估高维数据中组的重要性","authors":"G. McLachlan","doi":"10.1109/ICDM.2010.171","DOIUrl":null,"url":null,"abstract":"We consider the problem of assessing the significance of groups in high-dimensional data. In the case of supervised classification where there are data of known origin with respect to the groups under consideration, a guide to the degree of separation among the groups can be given in terms of the estimated error rate of a classifier formed to allocate a new observation to one of the groups. Even in this case with labelled training data, care has to be taken with the estimation of the error rate at least for high-dimensional data to avoid an overly optimistic assessment due to selection biases. In the case of unlabelled data, the problem of assessing whether groups identified from some data mining or cluster analytic procedure are genuine can be quite challenging, in particular for a large number of variables. We shall focus on the use of a resampling approach to this problem applied in conjunction with factor analytic models for the generation of the bootstrap samples under the null hypothesis for the number of groups. The proposed methods are to be demonstrated in their application to some high-dimensional data sets from the bioinformatics literature.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Assessing the Significance of Groups in High-Dimensional Data\",\"authors\":\"G. McLachlan\",\"doi\":\"10.1109/ICDM.2010.171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of assessing the significance of groups in high-dimensional data. In the case of supervised classification where there are data of known origin with respect to the groups under consideration, a guide to the degree of separation among the groups can be given in terms of the estimated error rate of a classifier formed to allocate a new observation to one of the groups. Even in this case with labelled training data, care has to be taken with the estimation of the error rate at least for high-dimensional data to avoid an overly optimistic assessment due to selection biases. In the case of unlabelled data, the problem of assessing whether groups identified from some data mining or cluster analytic procedure are genuine can be quite challenging, in particular for a large number of variables. We shall focus on the use of a resampling approach to this problem applied in conjunction with factor analytic models for the generation of the bootstrap samples under the null hypothesis for the number of groups. The proposed methods are to be demonstrated in their application to some high-dimensional data sets from the bioinformatics literature.\",\"PeriodicalId\":294061,\"journal\":{\"name\":\"2010 IEEE International Conference on Data Mining\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2010.171\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2010.171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Assessing the Significance of Groups in High-Dimensional Data
We consider the problem of assessing the significance of groups in high-dimensional data. In the case of supervised classification where there are data of known origin with respect to the groups under consideration, a guide to the degree of separation among the groups can be given in terms of the estimated error rate of a classifier formed to allocate a new observation to one of the groups. Even in this case with labelled training data, care has to be taken with the estimation of the error rate at least for high-dimensional data to avoid an overly optimistic assessment due to selection biases. In the case of unlabelled data, the problem of assessing whether groups identified from some data mining or cluster analytic procedure are genuine can be quite challenging, in particular for a large number of variables. We shall focus on the use of a resampling approach to this problem applied in conjunction with factor analytic models for the generation of the bootstrap samples under the null hypothesis for the number of groups. The proposed methods are to be demonstrated in their application to some high-dimensional data sets from the bioinformatics literature.