David Pantoja, Ismael Rodríguez, Fernando Rubio, Clara Segura
{"title":"Complexity analysis and practical resolution of the data classification problem with private characteristics","authors":"David Pantoja, Ismael Rodríguez, Fernando Rubio, Clara Segura","doi":"10.1007/s40747-025-01911-y","DOIUrl":null,"url":null,"abstract":"<p>In this work we analyze the problem of, given the probability distribution of a population, questioning an unknown individual that is representative of the distribution so that our uncertainty about certain characteristics is significantly reduced—but the uncertainty about others, deemed private or sensitive, is not. Thus, the goal of the problem is extracting information being relevant to a legitimate purpose while preserving the privacy of individuals, which is crucial to enable non-intrusive selection processes in several areas. For instance, it is essential in the design of non-discriminatory personnel selection, promotion, and layoff processes in companies and institutions; in the retrieval of customer information being relevant to the service provided by a company (and no more); in certifications not revealing sensitive industrial information being irrelevant for the certification itself; etc. Interactive questioning processes are constructed for this purpose, which requires generalizing the notion of <i>decision trees</i> to account the amount of desired and undesired information retrieved for each branch of the plan. Our findings about this problem are both theoretical and practical: on the one hand, we prove its NP-completeness by a reduction from the Set Cover problem; and on the other hand, given this intractability, we provide heuristic solutions to find reasonable solutions in affordable time. In particular, a greedy algorithm and two genetic algorithms are presented. Our experiments indicate that the best results are obtained using a genetic algorithm reinforced with a greedy strategy.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"35 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-025-01911-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this work we analyze the problem of, given the probability distribution of a population, questioning an unknown individual that is representative of the distribution so that our uncertainty about certain characteristics is significantly reduced—but the uncertainty about others, deemed private or sensitive, is not. Thus, the goal of the problem is extracting information being relevant to a legitimate purpose while preserving the privacy of individuals, which is crucial to enable non-intrusive selection processes in several areas. For instance, it is essential in the design of non-discriminatory personnel selection, promotion, and layoff processes in companies and institutions; in the retrieval of customer information being relevant to the service provided by a company (and no more); in certifications not revealing sensitive industrial information being irrelevant for the certification itself; etc. Interactive questioning processes are constructed for this purpose, which requires generalizing the notion of decision trees to account the amount of desired and undesired information retrieved for each branch of the plan. Our findings about this problem are both theoretical and practical: on the one hand, we prove its NP-completeness by a reduction from the Set Cover problem; and on the other hand, given this intractability, we provide heuristic solutions to find reasonable solutions in affordable time. In particular, a greedy algorithm and two genetic algorithms are presented. Our experiments indicate that the best results are obtained using a genetic algorithm reinforced with a greedy strategy.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.