Privacy preserving large language models: ChatGPT case study based vision and framework

IET Blockchain Pub Date : 2024-11-17 DOI:10.1049/blc2.12091

Imdad Ullah, Najm Hassan, Sukhpal Singh Gill, Basem Suleiman, Tariq Ahamed Ahanger, Zawar Shah, Junaid Qadir, Salil S. Kanhere

{"title":"Privacy preserving large language models: ChatGPT case study based vision and framework","authors":"Imdad Ullah, Najm Hassan, Sukhpal Singh Gill, Basem Suleiman, Tariq Ahamed Ahanger, Zawar Shah, Junaid Qadir, Salil S. Kanhere","doi":"10.1049/blc2.12091","DOIUrl":null,"url":null,"abstract":"<p>The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.</p>","PeriodicalId":100650,"journal":{"name":"IET Blockchain","volume":"4 S1","pages":"706-724"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/blc2.12091","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Blockchain","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/blc2.12091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.

Abstract Image

查看原文本刊更多论文

隐私保护大型语言模型：基于愿景和框架的ChatGPT案例研究

基于大型语言模型（llm）的生成式人工智能（AI）工具使用数十亿个参数来广泛分析大型数据集并提取关键信息，如上下文、特定细节、识别信息，在训练过程中使用这些信息，并为请求的查询生成响应。被提取的数据还包含敏感信息，严重威胁用户隐私，用户不愿使用此类工具。本文提出了PrivChatGPT概念模型，这是一种法学硕士的隐私保护模型，由两个主要部分组成，即在数据管理/预处理过程中保护用户隐私，以及在大规模数据中保护隐私上下文和隐私训练过程。为了证明PrivChatGPT的适用性，展示了如何将私有机制集成到现有模型中以培训法学硕士以保护用户隐私；具体而言，采用差分隐私和使用强化学习（RL）的私人训练。隐私级别概率与文档内容（包括隐私上下文信息）和元数据相关联，元数据用于评估个人隐私信息的泄露概率损失。在应用差分隐私时，测量隐私损失，并使用熵来评估不确定性或随机性的度量。在每次为训练目的添加新信息时，它递归地评估隐私保证的级别以及公共数据库和资源的不确定性。为了批判性地评估私人法学硕士对差异隐私的使用，我们假设比较了其他机制，如区块链、私人信息检索、随机化、混淆、匿名化，以及使用Tor进行各种性能度量，如模型性能和准确性、计算复杂性、隐私与效用、训练延迟、易受攻击和资源消耗。得出的结论是，不同的隐私、随机化和混淆会影响训练模型的效用和性能；相反，使用Tor、b区块链和私有信息检索（PIR）可能会引入额外的计算复杂性和高训练延迟。人们认为，所提出的模型可以用作生成人工智能工具的隐私保护法学硕士的基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Blockchain

CiteScore

1.80

自引率

0.00%

发文量