Secure latent Dirichlet allocation.

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES

Frontiers in digital health Pub Date : 2025-07-24 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1610228

Thijs Veugen, Vincent Dunning, Michiel Marcus, Bart Kamphorst

{"title":"Secure latent Dirichlet allocation.","authors":"Thijs Veugen, Vincent Dunning, Michiel Marcus, Bart Kamphorst","doi":"10.3389/fdgth.2025.1610228","DOIUrl":null,"url":null,"abstract":"<p><p>Topic modelling refers to a popular set of techniques used to discover hidden topics that occur in a collection of documents. These topics can, for example, be used to categorize documents or label text for further processing. One popular topic modelling technique is Latent Dirichlet Allocation (LDA). In topic modelling scenarios, the documents are often assumed to be in one, centralized dataset. However, sometimes documents are held by different parties, and contain privacy- or commercially-sensitive information that cannot be shared. We present a novel, decentralized approach to train an LDA model securely without having to share any information about the content of the documents. We preserve the privacy of the individual parties using a combination of privacy enhancing technologies. Next to the secure LDA protocol, we introduce two new cryptographic building blocks that are of independent interest; a way to efficiently convert between secret-shared- and homomorphic-encrypted data as well as a method to efficiently draw a random number from a finite set with secret weights. We show that our decentralized, privacy preserving LDA solution has a similar accuracy compared to an (insecure) centralised approach. With 1024-bit Paillier keys, a topic model with 5 topics and 3000 words can be trained in around 16 h. Furthermore, we show that the solution scales linearly in the total number of words and the number of topics.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1610228"},"PeriodicalIF":3.2000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12328381/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1610228","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Topic modelling refers to a popular set of techniques used to discover hidden topics that occur in a collection of documents. These topics can, for example, be used to categorize documents or label text for further processing. One popular topic modelling technique is Latent Dirichlet Allocation (LDA). In topic modelling scenarios, the documents are often assumed to be in one, centralized dataset. However, sometimes documents are held by different parties, and contain privacy- or commercially-sensitive information that cannot be shared. We present a novel, decentralized approach to train an LDA model securely without having to share any information about the content of the documents. We preserve the privacy of the individual parties using a combination of privacy enhancing technologies. Next to the secure LDA protocol, we introduce two new cryptographic building blocks that are of independent interest; a way to efficiently convert between secret-shared- and homomorphic-encrypted data as well as a method to efficiently draw a random number from a finite set with secret weights. We show that our decentralized, privacy preserving LDA solution has a similar accuracy compared to an (insecure) centralised approach. With 1024-bit Paillier keys, a topic model with 5 topics and 3000 words can be trained in around 16 h. Furthermore, we show that the solution scales linearly in the total number of words and the number of topics.

Abstract Image

查看原文本刊更多论文

安全潜在狄利克雷分配。

主题建模是指一组流行的技术，用于发现文档集合中出现的隐藏主题。例如，这些主题可用于对文档进行分类或标记文本以供进一步处理。一种流行的主题建模技术是潜在狄利克雷分配（Latent Dirichlet Allocation， LDA）。在主题建模场景中，通常假设文档位于一个集中的数据集中。然而，有时文件由不同方持有，并且包含不能共享的隐私或商业敏感信息。我们提出了一种新颖的、分散的方法来安全地训练LDA模型，而无需共享关于文档内容的任何信息。我们使用一系列增强隐私的技术来保护各方的隐私。在安全LDA协议旁边，我们引入了两个新的加密构建块，它们是独立的；一种有效地在秘密共享和同态加密数据之间进行转换的方法，以及一种有效地从具有秘密权值的有限集合中抽取随机数的方法。我们证明，与（不安全的）集中式方法相比，我们的去中心化、保护隐私的LDA解决方案具有相似的准确性。使用1024位的Paillier密钥，一个包含5个主题和3000个单词的主题模型可以在大约16小时内训练完成。此外，我们证明了该解决方案在单词总数和主题数量上呈线性扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊