Complexity analysis and practical resolution of the data classification problem with private characteristics

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2025-05-08 DOI:10.1007/s40747-025-01911-y

David Pantoja, Ismael Rodríguez, Fernando Rubio, Clara Segura

{"title":"Complexity analysis and practical resolution of the data classification problem with private characteristics","authors":"David Pantoja, Ismael Rodríguez, Fernando Rubio, Clara Segura","doi":"10.1007/s40747-025-01911-y","DOIUrl":null,"url":null,"abstract":"<p>In this work we analyze the problem of, given the probability distribution of a population, questioning an unknown individual that is representative of the distribution so that our uncertainty about certain characteristics is significantly reduced—but the uncertainty about others, deemed private or sensitive, is not. Thus, the goal of the problem is extracting information being relevant to a legitimate purpose while preserving the privacy of individuals, which is crucial to enable non-intrusive selection processes in several areas. For instance, it is essential in the design of non-discriminatory personnel selection, promotion, and layoff processes in companies and institutions; in the retrieval of customer information being relevant to the service provided by a company (and no more); in certifications not revealing sensitive industrial information being irrelevant for the certification itself; etc. Interactive questioning processes are constructed for this purpose, which requires generalizing the notion of <i>decision trees</i> to account the amount of desired and undesired information retrieved for each branch of the plan. Our findings about this problem are both theoretical and practical: on the one hand, we prove its NP-completeness by a reduction from the Set Cover problem; and on the other hand, given this intractability, we provide heuristic solutions to find reasonable solutions in affordable time. In particular, a greedy algorithm and two genetic algorithms are presented. Our experiments indicate that the best results are obtained using a genetic algorithm reinforced with a greedy strategy.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"35 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-025-01911-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In this work we analyze the problem of, given the probability distribution of a population, questioning an unknown individual that is representative of the distribution so that our uncertainty about certain characteristics is significantly reduced—but the uncertainty about others, deemed private or sensitive, is not. Thus, the goal of the problem is extracting information being relevant to a legitimate purpose while preserving the privacy of individuals, which is crucial to enable non-intrusive selection processes in several areas. For instance, it is essential in the design of non-discriminatory personnel selection, promotion, and layoff processes in companies and institutions; in the retrieval of customer information being relevant to the service provided by a company (and no more); in certifications not revealing sensitive industrial information being irrelevant for the certification itself; etc. Interactive questioning processes are constructed for this purpose, which requires generalizing the notion of decision trees to account the amount of desired and undesired information retrieved for each branch of the plan. Our findings about this problem are both theoretical and practical: on the one hand, we prove its NP-completeness by a reduction from the Set Cover problem; and on the other hand, given this intractability, we provide heuristic solutions to find reasonable solutions in affordable time. In particular, a greedy algorithm and two genetic algorithms are presented. Our experiments indicate that the best results are obtained using a genetic algorithm reinforced with a greedy strategy.

查看原文本刊更多论文

私有特征数据分类问题的复杂性分析与实际解决

在这项工作中，我们分析了这样一个问题：给定一个群体的概率分布，询问一个未知的个体，这个个体代表了这个分布，这样我们对某些特征的不确定性就会大大减少——但对其他被认为是私人或敏感的不确定性就不会。因此，问题的目标是在保留个人隐私的同时提取与合法目的相关的信息，这对于在几个领域实现非侵入性选择过程至关重要。例如，在公司和机构设计非歧视的人员选择、晋升和解雇程序时，它是必不可少的；检索与公司提供的服务相关的客户信息（仅此而已）；在认证中不披露与认证本身无关的敏感工业信息；等。交互式提问过程是为此目的而构建的，它需要推广决策树的概念，以说明为计划的每个分支检索到的所需和不需要的信息的数量。我们的研究结果既有理论意义又有实践意义：一方面，我们通过对集合覆盖问题的约简证明了它的np -完备性；另一方面，考虑到这种棘手性，我们提供启发式解决方案，以便在可承受的时间内找到合理的解决方案。特别提出了一种贪心算法和两种遗传算法。实验结果表明，采用贪心策略增强的遗传算法可以获得最好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.