Privacy-Preserving Deep Learning NLP Models for Cancer Registries

IF 5.1 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Mohammed Alawad;Hong-Jun Yoon;Shang Gao;Brent Mumphrey;Xiao-Cheng Wu;Eric B. Durbin;Jong Cheol Jeong;Isaac Hands;David Rust;Linda Coyle;Lynne Penberthy;Georgia Tourassi
{"title":"Privacy-Preserving Deep Learning NLP Models for Cancer Registries","authors":"Mohammed Alawad;Hong-Jun Yoon;Shang Gao;Brent Mumphrey;Xiao-Cheng Wu;Eric B. Durbin;Jong Cheol Jeong;Isaac Hands;David Rust;Linda Coyle;Lynne Penberthy;Georgia Tourassi","doi":"10.1109/TETC.2020.2983404","DOIUrl":null,"url":null,"abstract":"Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this article, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics – tumor site, subsite, laterality, behavior, histology, and grade – from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"9 3","pages":"1219-1230"},"PeriodicalIF":5.1000,"publicationDate":"2020-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TETC.2020.2983404","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/9069186/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 19

Abstract

Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this article, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics – tumor site, subsite, laterality, behavior, histology, and grade – from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).

Abstract Image

Abstract Image

癌症注册中心的保密深度学习NLP模型
癌症人群登记可以受益于深度学习(DL),从他们每年处理的大量非结构化病理学文本报告中自动提取癌症特征。DL解决这一问题和其他现实世界问题的成功与用于模型训练的大型标记数据集的可用性成正比。尽管癌症登记处之间的合作对于充分利用DL的前景至关重要,但隐私和保密问题是癌症登记处数据共享的主要障碍。此外,用于自然语言处理(NLP)的DL需要共享用于嵌入层的词汇字典,该词汇字典可以包含患者标识符。因此,即使在癌症登记处分发经过训练的模型也会导致侵犯隐私的问题。在本文中,我们提出了在不共享敏感数据的情况下,通过保护隐私的迁移学习方法进行DL NLP模型分发。这些方法用于在癌症注册中心之间分发多任务卷积神经网络(MT-CNN)NLP模型。该模型经过训练,从癌症病理报告中提取六个关键的癌症特征——肿瘤部位、亚部位、偏侧性、行为、组织学和分级。使用来自两个癌症登记处的410064份病理学文件,我们比较了我们提出的传统转移学习方法,而无需隐私保护、单登记模型和在中央托管数据上训练的模型。结果表明,包括数据共享和模型分布在内的迁移学习方法显著优于单一注册表模型。此外,与集中式模型(0.827,0.585)相比,性能最好的隐私保护模型分布方法在所有提取任务中实现了统计上不可区分的平均微观和宏观F1分数(0.823,0.580)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Emerging Topics in Computing
IEEE Transactions on Emerging Topics in Computing Computer Science-Computer Science (miscellaneous)
CiteScore
12.10
自引率
5.10%
发文量
113
期刊介绍: IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信