Double-Constrained Consensus Clustering with Application to Online Anti-Counterfeiting

IF 2.5 4区综合性期刊 Q2 CHEMISTRY, MULTIDISCIPLINARY

Applied Sciences-Basel Pub Date : 2023-09-06 DOI:10.3390/app131810050

Claudio Carpineto, Giovanni Romano

{"title":"Double-Constrained Consensus Clustering with Application to Online Anti-Counterfeiting","authors":"Claudio Carpineto, Giovanni Romano","doi":"10.3390/app131810050","DOIUrl":null,"url":null,"abstract":"Semi-supervised consensus clustering is a promising strategy to compensate for the subjectivity of clustering and its sensitivity to design factors, with various techniques being recently proposed to integrate domain knowledge and multiple clustering partitions. In this article, we present a new approach that makes double use of domain knowledge, namely to build the initial partitions, as well as to combine them. In particular, we show how to model and integrate must-link and cannot-link constraints into the objective function of a generic consensus clustering (CC) framework that maximizes the similarity between the consensus partition and the input partitions, which have, in turn, been enriched with the same constraints. In addition, borrowing from the theory of functional dependencies, the integrated framework exploits the notions of deductive closure and minimal cover to take full advantage of the logical implication between constraints. Using standard UCI benchmarks, we found that the resulting algorithm, termed CCC double-constrained consensus clustering), was more effective than plain CC at combining base-constrained partitions, with an average performance improvement of 5.54%. We then argue that CCC is especially well-suited for profiling counterfeit e-commerce websites, as constraints can be acquired by leveraging specific domain features, and demonstrate its potential for detecting affiliate marketing programs. Taken together, our experiments suggest that CCC makes the process of clustering more robust and able to withstand changes in clustering algorithms, datasets, and features, with a remarkable improvement in average performance.","PeriodicalId":48760,"journal":{"name":"Applied Sciences-Basel","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Sciences-Basel","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.3390/app131810050","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Semi-supervised consensus clustering is a promising strategy to compensate for the subjectivity of clustering and its sensitivity to design factors, with various techniques being recently proposed to integrate domain knowledge and multiple clustering partitions. In this article, we present a new approach that makes double use of domain knowledge, namely to build the initial partitions, as well as to combine them. In particular, we show how to model and integrate must-link and cannot-link constraints into the objective function of a generic consensus clustering (CC) framework that maximizes the similarity between the consensus partition and the input partitions, which have, in turn, been enriched with the same constraints. In addition, borrowing from the theory of functional dependencies, the integrated framework exploits the notions of deductive closure and minimal cover to take full advantage of the logical implication between constraints. Using standard UCI benchmarks, we found that the resulting algorithm, termed CCC double-constrained consensus clustering), was more effective than plain CC at combining base-constrained partitions, with an average performance improvement of 5.54%. We then argue that CCC is especially well-suited for profiling counterfeit e-commerce websites, as constraints can be acquired by leveraging specific domain features, and demonstrate its potential for detecting affiliate marketing programs. Taken together, our experiments suggest that CCC makes the process of clustering more robust and able to withstand changes in clustering algorithms, datasets, and features, with a remarkable improvement in average performance.

查看原文本刊更多论文

双约束一致性聚类及其在网上防伪中的应用

半监督一致性聚类是一种很有前途的策略，可以补偿聚类的主观性及其对设计因素的敏感性，最近提出了各种技术来集成领域知识和多个聚类分区。在本文中，我们提出了一种双重利用领域知识的新方法，即构建初始分区，以及将它们组合在一起。特别是，我们展示了如何将必须链接和不能链接的约束建模和集成到通用一致性聚类（CC）框架的目标函数中，该框架最大限度地提高了一致性分区和输入分区之间的相似性，而输入分区又被相同的约束所丰富。此外，借用函数依赖理论，集成框架利用演绎闭包和最小覆盖的概念，充分利用约束之间的逻辑含义。使用标准的UCI基准，我们发现所得到的算法（称为CCC双约束一致性聚类）在组合基本约束分区方面比普通CC更有效，平均性能提高了5.54%。然后我们认为CCC特别适合分析假冒电子商务网站，因为可以通过利用特定的领域功能来获得限制，并展示其检测联盟营销计划的潜力。总之，我们的实验表明，CCC使聚类过程更加稳健，能够承受聚类算法、数据集和特征的变化，平均性能显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Sciences-Basel CHEMISTRY, MULTIDISCIPLINARYMATERIALS SCIE-MATERIALS SCIENCE, MULTIDISCIPLINARY

CiteScore

5.30

自引率

11.10%

发文量

10882

期刊介绍： Applied Sciences (ISSN 2076-3417) provides an advanced forum on all aspects of applied natural sciences. It publishes reviews, research papers and communications. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced. Electronic files and software regarding the full details of the calculation or experimental procedure, if unable to be published in a normal way, can be deposited as supplementary electronic material.