Privacy preserving large language models: ChatGPT case study based vision and framework

IET Blockchain Pub Date : 2024-11-17 DOI:10.1049/blc2.12091
Imdad Ullah, Najm Hassan, Sukhpal Singh Gill, Basem Suleiman, Tariq Ahamed Ahanger, Zawar Shah, Junaid Qadir, Salil S. Kanhere
{"title":"Privacy preserving large language models: ChatGPT case study based vision and framework","authors":"Imdad Ullah,&nbsp;Najm Hassan,&nbsp;Sukhpal Singh Gill,&nbsp;Basem Suleiman,&nbsp;Tariq Ahamed Ahanger,&nbsp;Zawar Shah,&nbsp;Junaid Qadir,&nbsp;Salil S. Kanhere","doi":"10.1049/blc2.12091","DOIUrl":null,"url":null,"abstract":"<p>The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.</p>","PeriodicalId":100650,"journal":{"name":"IET Blockchain","volume":"4 S1","pages":"706-724"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/blc2.12091","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Blockchain","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/blc2.12091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The generative Artificial Intelligence (AI) tools based on Large Language Models (LLMs) use billions of parameters to extensively analyse large datasets and extract critical information such as context, specific details, identifying information, use this information in the training process, and generate responses for the requested queries. The extracted data also contain sensitive information, seriously threatening user privacy and reluctance to use such tools. This article proposes the conceptual model called PrivChatGPT, a privacy-preserving model for LLMs consisting of two main components, that is, preserving user privacy during the data curation/pre-processing and preserving private context and the private training process for large-scale data. To demonstrate the applicability of PrivChatGPT, it is shown how a private mechanism could be integrated into the existing model for training LLMs to protect user privacy; specifically, differential privacy and private training using Reinforcement Learning (RL) were employed. The privacy level probabilities are associated with the document contents, including the private contextual information, and with metadata, which is used to evaluate the disclosure probability loss for an individual's private information. The privacy loss is measured and the measure of uncertainty or randomness is evaluated using entropy once differential privacy is applied. It recursively evaluates the level of privacy guarantees and the uncertainty of public databases and resources during each update when new information is added for training purposes. To critically evaluate the use of differential privacy for private LLMs, other mechanisms were hypothetically compared such as Blockchain, private information retrieval, randomisation, obfuscation, anonymisation, and the use of Tor for various performance measures such as the model performance and accuracy, computational complexity, privacy vs. utility, training latency, vulnerability to attacks, and resource consumption. It is concluded that differential privacy, randomisation, and obfuscation can impact the training models' utility and performance; conversely, using Tor, Blockchain, and Private Information Retrieval (PIR) may introduce additional computational complexity and high training latency. It is believed that the proposed model could be used as a benchmark for privacy-preserving LLMs for generative AI tools.

Abstract Image

求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信