{"title":"一种新的混合值k-原型算法及其在医疗保健数据库挖掘中的应用","authors":"Ahmed Najjar, Christian Gagné, D. Reinharz","doi":"10.1109/CICARE.2014.7007849","DOIUrl":null,"url":null,"abstract":"The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed of numerical and categorical variables, a well-known algorithm being the k-prototypes. This algorithm is efficient for clustering large datasets given its linear complexity. However, many fields are handling more complex data, for example variable-size sets of categorical values mixed with numerical and categorical values, which cannot be processed as is by the k-prototypes algorithm. We are proposing a variation of the k-prototypes clustering algorithm that can handle these complex entities, by using a bag-of-words representation for the multivalued categorical variables. We evaluate our approach on a real-world application to the clustering of administrative health care databases in Quebec, with results illustrating the good performances of our method.","PeriodicalId":120730,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A novel mixed values k-prototypes algorithm with application to health care databases mining\",\"authors\":\"Ahmed Najjar, Christian Gagné, D. Reinharz\",\"doi\":\"10.1109/CICARE.2014.7007849\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed of numerical and categorical variables, a well-known algorithm being the k-prototypes. This algorithm is efficient for clustering large datasets given its linear complexity. However, many fields are handling more complex data, for example variable-size sets of categorical values mixed with numerical and categorical values, which cannot be processed as is by the k-prototypes algorithm. We are proposing a variation of the k-prototypes clustering algorithm that can handle these complex entities, by using a bag-of-words representation for the multivalued categorical variables. We evaluate our approach on a real-world application to the clustering of administrative health care databases in Quebec, with results illustrating the good performances of our method.\",\"PeriodicalId\":120730,\"journal\":{\"name\":\"2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CICARE.2014.7007849\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICARE.2014.7007849","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A novel mixed values k-prototypes algorithm with application to health care databases mining
The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed of numerical and categorical variables, a well-known algorithm being the k-prototypes. This algorithm is efficient for clustering large datasets given its linear complexity. However, many fields are handling more complex data, for example variable-size sets of categorical values mixed with numerical and categorical values, which cannot be processed as is by the k-prototypes algorithm. We are proposing a variation of the k-prototypes clustering algorithm that can handle these complex entities, by using a bag-of-words representation for the multivalued categorical variables. We evaluate our approach on a real-world application to the clustering of administrative health care databases in Quebec, with results illustrating the good performances of our method.