FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models

IEEE transactions on artificial intelligence Pub Date : 2024-11-15 DOI:10.1109/TAI.2024.3499939

Subhodip Panda;A.P. Prathosh

{"title":"FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models","authors":"Subhodip Panda;A.P. Prathosh","doi":"10.1109/TAI.2024.3499939","DOIUrl":null,"url":null,"abstract":"The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, <italic>machine unlearning</i> has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pretrained models. However, modern <italic>machine unlearning</i> approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pretrained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. Our proposed method <italic>feature aware similarity thresholding (FAST)</i> effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space. We employ user-marked positive and negative samples to guide this process, leveraging the latent space's inherent capacity to capture these undesired representations. During inference, we use this identified representation in the latent space to compute projection similarity metrics with newly sampled latent vectors. Subsequently, we meticulously apply a threshold to exclude undesirable samples from the output.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 4","pages":"885-895"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10754629/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, machine unlearning has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pretrained models. However, modern machine unlearning approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pretrained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. Our proposed method feature aware similarity thresholding (FAST) effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space. We employ user-marked positive and negative samples to guide this process, leveraging the latent space's inherent capacity to capture these undesired representations. During inference, we use this identified representation in the latent space to compute projection similarity metrics with newly sampled latent vectors. Subsequently, we meticulously apply a threshold to exclude undesirable samples from the output.

查看原文本刊更多论文

FAST：针对黑盒生成模型中弱未学习功能的特征感知相似性阈值法

由于对隐私和遵守监管框架的担忧不断升级，对深度生成模型的监管受到了高度重视，这凸显了对这些模型的精确控制机制的迫切需要。这种迫切性在生成模型产生包含令人反感的、冒犯性的或潜在有害内容的输出的实例中特别突出。作为回应，机器学习已经出现，可以选择性地忘记特定知识或消除预训练模型中不希望的数据子集的影响。然而，现代机器学习方法通常假设在学习过程中访问模型参数和架构细节，这并不总是可行的。在许多下游任务中，这些模型就像黑盒系统一样，具有不可访问的预训练参数、体系结构和训练数据。在这种情况下，过滤不希望的输出的可能性成为一种实际的替代方案。我们提出的特征感知相似阈值（FAST）方法通过系统地编码潜在空间中不需要的特征的表示，有效地抑制了不希望的输出。我们使用用户标记的正面和负面样本来指导这一过程，利用潜在空间的固有能力来捕获这些不希望的表征。在推理过程中，我们在潜在空间中使用这种识别的表示来计算新采样的潜在向量的投影相似性度量。随后，我们一丝不苟地应用阈值从输出中排除不需要的样本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量