Optimal clustering from noisy binary feedback

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning Pub Date : 2024-03-22 DOI:10.1007/s10994-024-06532-z

{"title":"Optimal clustering from noisy binary feedback","authors":"","doi":"10.1007/s10994-024-06532-z","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the <em>hardness</em> of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"24 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06532-z","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the hardness of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.

查看原文本刊更多论文

从噪声二进制反馈中优化聚类

摘要我们研究了根据二进制用户反馈对一组项目进行聚类的问题。这种问题出现在众包平台上，用户只需付出最小的努力就能解决大规模的标记任务。例如，在最近的一些 reCAPTCHA 系统中，用户的点击（二进制答案）可以用来有效地标记图像。在我们的推理问题中，项目被归入最初未知的非重叠群组。为了恢复这些群集，学习者会依次向用户展示一个有限的项目列表，以及一个从固定的有限集合中选出的带有二进制答案的问题。对于每个项目，用户都会提供一个噪声答案，其期望值由项目群和问题以及表征项目分类难易程度的特定项目参数决定。我们的目标是设计一种具有最小群组恢复错误率的算法。我们针对统一和自适应（列表、问题）选择策略，推导出了任何算法所满足的错误率的特定问题信息理论下限。对于统一选择，我们提出了一种建立在 K-means 算法基础上的简单算法，其性能几乎与基本限制相匹配。对于自适应选择，我们开发了一种自适应算法，该算法受到信息论误差下限推导的启发，进而以一种有效的方式分配预算。该算法学会更频繁地选择难以聚类的项目和相关问题。我们用数字比较了有无自适应选择策略的算法性能，并说明了自适应所带来的收益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine Learning 工程技术-计算机：人工智能

CiteScore

11.00

自引率

2.70%

发文量

162

审稿时长

3 months

期刊介绍： Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.