Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

M. Alawad, Shang Gao, John X. Qiu, Hong-Jun Yoon, J. B. Christian, Lynne Penberthy, B. Mumphrey, Xiao-Cheng Wu, Linda Coyle, G. Tourassi
{"title":"Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks","authors":"M. Alawad, Shang Gao, John X. Qiu, Hong-Jun Yoon, J. B. Christian, Lynne Penberthy, B. Mumphrey, Xiao-Cheng Wu, Linda Coyle, G. Tourassi","doi":"10.1093/jamia/ocz153","DOIUrl":null,"url":null,"abstract":"Abstract Objective We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Materials and Methods Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). Results MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. Conclusions The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task–specific model.","PeriodicalId":236137,"journal":{"name":"Journal of the American Medical Informatics Association : JAMIA","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association : JAMIA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamia/ocz153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 53

Abstract

Abstract Objective We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Materials and Methods Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). Results MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. Conclusions The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task–specific model.
使用多任务卷积神经网络从自由文本病理报告中自动提取癌症注册报告信息
摘要目的采用硬参数共享和十字绣两种不同的多任务学习(MTL)技术,训练用于病理报告非结构化文本中癌症数据自动提取的词级卷积神经网络(CNN)。我们展示了学习相关信息提取(IE)任务的重要性,利用任务之间的共享表示来实现最先进的分类精度和计算效率。Multitask CNN (MTCNN)试图通过学习同时提取多个关键癌症特征来解决文档信息提取问题。我们训练我们的MTCNN执行5个信息提取任务:(1)原发性癌症部位(65类),(2)侧边性(4类),(3)行为(3类),(4)组织学类型(63类),(5)组织学分级(5类)。我们评估了从路易斯安那州肿瘤登记处获得的95231份病理文件(71 223个独特的肿瘤)的性能。我们将MTCNN模型与单任务CNN模型和两种传统机器学习方法,即支持向量机(SVM)和随机森林分类器(RFC)的性能进行了比较。结果与其他机器学习模型相比,mtcnn在分类准确率方面在所有5个任务中都表现优异。基于回顾性评估,硬参数共享和十字绣MTCNN模型在所有5个任务中分别正确分类了59.04%和57.93%的病理报告。基线模型的准确率分别为53.68% (CNN)、46.37% (RFC)和36.75% (SVM)。基于前瞻性评价,5个任务的正确率分别为60.11%(硬参数共享)、58.13%(十字画)、51.30%(单任务CNN)、42.07% (RFC)和35.16% (SVM)。此外,硬参数共享mtcnn通过使用与单任务CNN相同数量的可训练参数,在计算效率上优于其他模型。硬参数共享的MTCNN在广泛的癌症和多个信息提取任务的病理文档自动编码支持方面提供了更高的分类精度,同时保持了与单一任务特定模型相似的训练和推理时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信