增强二值数据聚类的相似性度量:罕见事件和匹配缺失的作用

IF 2.1 4区 化学 Q1 SOCIAL WORK
Tânia F. G. G. Cova, Alberto A. C. C. Pais
{"title":"增强二值数据聚类的相似性度量:罕见事件和匹配缺失的作用","authors":"Tânia F. G. G. Cova,&nbsp;Alberto A. C. C. Pais","doi":"10.1002/cem.70061","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low-frequency features. This study proposes a probability-weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.</p>\n </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 9","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences\",\"authors\":\"Tânia F. G. G. Cova,&nbsp;Alberto A. C. C. Pais\",\"doi\":\"10.1002/cem.70061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low-frequency features. This study proposes a probability-weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.</p>\\n </div>\",\"PeriodicalId\":15274,\"journal\":{\"name\":\"Journal of Chemometrics\",\"volume\":\"39 9\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemometrics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.70061\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"SOCIAL WORK\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemometrics","FirstCategoryId":"92","ListUrlMain":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.70061","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL WORK","Score":null,"Total":0}
引用次数: 0

摘要

二进制数据的聚类是各种应用的核心,特别是在医学诊断、化学和化学信息学领域。然而,标准的相似性度量往往不能捕获稀有特征和匹配缺失的信息价值,将所有属性视为同等相关。这可能导致次优聚类,特别是当信息模式隐藏在低频特征中时。本文提出了一种概率加权方法来衡量相似性,该方法赋予罕见特征更多的权重,并根据它们的出现概率来计算共享缺席的值。我们使用视觉比较和真实数据集的实验来分析这种调整如何影响聚类结果。结果表明,与标准度量相比,聚类精度和稳定性得到了一致的提高。我们的研究结果表明,将特征的稀缺性纳入相似度计算可以为二元数据的聚类提供更可靠的基础,特别是在罕见信号携带有意义信息的领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low-frequency features. This study proposes a probability-weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Chemometrics
Journal of Chemometrics 化学-分析化学
CiteScore
5.20
自引率
8.30%
发文量
78
审稿时长
2 months
期刊介绍: The Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics. It also provides a forum for the exchange of information on meetings and other news relevant to the growing community of scientists who are interested in chemometrics and its applications. Short, critical review papers are a particularly important feature of the journal, in view of the multidisciplinary readership at which it is aimed.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信