A novel mixed values k-prototypes algorithm with application to health care databases mining

Ahmed Najjar, Christian Gagné, D. Reinharz
{"title":"A novel mixed values k-prototypes algorithm with application to health care databases mining","authors":"Ahmed Najjar, Christian Gagné, D. Reinharz","doi":"10.1109/CICARE.2014.7007849","DOIUrl":null,"url":null,"abstract":"The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed of numerical and categorical variables, a well-known algorithm being the k-prototypes. This algorithm is efficient for clustering large datasets given its linear complexity. However, many fields are handling more complex data, for example variable-size sets of categorical values mixed with numerical and categorical values, which cannot be processed as is by the k-prototypes algorithm. We are proposing a variation of the k-prototypes clustering algorithm that can handle these complex entities, by using a bag-of-words representation for the multivalued categorical variables. We evaluate our approach on a real-world application to the clustering of administrative health care databases in Quebec, with results illustrating the good performances of our method.","PeriodicalId":120730,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Symposium on Computational Intelligence in Healthcare and e-health (CICARE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICARE.2014.7007849","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed of numerical and categorical variables, a well-known algorithm being the k-prototypes. This algorithm is efficient for clustering large datasets given its linear complexity. However, many fields are handling more complex data, for example variable-size sets of categorical values mixed with numerical and categorical values, which cannot be processed as is by the k-prototypes algorithm. We are proposing a variation of the k-prototypes clustering algorithm that can handle these complex entities, by using a bag-of-words representation for the multivalued categorical variables. We evaluate our approach on a real-world application to the clustering of administrative health care databases in Quebec, with results illustrating the good performances of our method.
一种新的混合值k-原型算法及其在医疗保健数据库挖掘中的应用
目前由异构对象组成的大型数据集的可用性强调了混合复杂项目的大规模聚类的重要性。对于由数值变量和分类变量组成的混合数据集,已经开发了几种算法,其中一个著名的算法是k-原型。考虑到大数据集的线性复杂度,该算法对大数据集的聚类是有效的。然而,许多字段正在处理更复杂的数据,例如混合了数值和分类值的可变大小的分类值集,这不能像k-prototype算法那样处理。我们提出了k-原型聚类算法的一种变体,通过使用词袋表示多值分类变量,可以处理这些复杂的实体。我们在魁北克省行政卫生保健数据库聚类的实际应用中评估了我们的方法,结果表明我们的方法具有良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信