增强二值数据聚类的相似性度量：罕见事件和匹配缺失的作用

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics Pub Date : 2025-09-04 DOI:10.1002/cem.70061

Tânia F. G. G. Cova, Alberto A. C. C. Pais

{"title":"增强二值数据聚类的相似性度量：罕见事件和匹配缺失的作用","authors":"Tânia F. G. G. Cova, Alberto A. C. C. Pais","doi":"10.1002/cem.70061","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low-frequency features. This study proposes a probability-weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.</p>\n </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 9","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences\",\"authors\":\"Tânia F. G. G. Cova, Alberto A. C. C. Pais\",\"doi\":\"10.1002/cem.70061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low-frequency features. This study proposes a probability-weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.</p>\\n </div>\",\"PeriodicalId\":15274,\"journal\":{\"name\":\"Journal of Chemometrics\",\"volume\":\"39 9\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemometrics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.70061\",\"RegionNum\":4,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"SOCIAL WORK\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemometrics","FirstCategoryId":"92","ListUrlMain":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.70061","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL WORK","Score":null,"Total":0}

引用次数: 0

摘要

二进制数据的聚类是各种应用的核心，特别是在医学诊断、化学和化学信息学领域。然而，标准的相似性度量往往不能捕获稀有特征和匹配缺失的信息价值，将所有属性视为同等相关。这可能导致次优聚类，特别是当信息模式隐藏在低频特征中时。本文提出了一种概率加权方法来衡量相似性，该方法赋予罕见特征更多的权重，并根据它们的出现概率来计算共享缺席的值。我们使用视觉比较和真实数据集的实验来分析这种调整如何影响聚类结果。结果表明，与标准度量相比，聚类精度和稳定性得到了一致的提高。我们的研究结果表明，将特征的稀缺性纳入相似度计算可以为二元数据的聚类提供更可靠的基础，特别是在罕见信号携带有意义信息的领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

查看原文本刊更多论文

Enhancing Similarity Measures for Binary Data in Clustering: The Role of Rare Events and Matching Absences

Clustering of binary data is central to various applications, particularly in the fields of medical diagnostics, chemistry, and chemoinformatics. However, standard similarity measures often fail to capture the informative value of rare features and matching absences, treating all attributes as equally relevant. This can lead to suboptimal clustering, especially when informative patterns are hidden in low-frequency features. This study proposes a probability-weighted approach to measuring similarity, which gives more weight to rare features and accounts for the value of shared absences based on their occurrence probabilities. We analyze how this adjustment impacts clustering results, using visual comparisons and experiments on real datasets. The results show consistent gains in clustering precision and stability compared to standard measures. Our findings suggest that incorporating the rarity of features into similarity computation can offer a more reliable basis for clustering binary data, especially in domains where rare signals carry meaningful information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemometrics 化学-分析化学

CiteScore

5.20

自引率

8.30%

发文量

审稿时长

2 months

期刊介绍： The Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics. It also provides a forum for the exchange of information on meetings and other news relevant to the growing community of scientists who are interested in chemometrics and its applications. Short, critical review papers are a particularly important feature of the journal, in view of the multidisciplinary readership at which it is aimed.