Introduction to the Special Issue on Computational Methods for Biomedical NLP

ACM Transactions on Computing for Healthcare (HEALTH) Pub Date : 2022-01-12 DOI:10.1145/3492302

M. Devarakonda, E. Voorhees

{"title":"Introduction to the Special Issue on Computational Methods for Biomedical NLP","authors":"M. Devarakonda, E. Voorhees","doi":"10.1145/3492302","DOIUrl":null,"url":null,"abstract":"It is now well established that biomedical text requires methods targeted for the domain. Developments in deep learning and a series of successful shared challenges have contributed to a steady progress in techniques for natural language processing of biomedical text. Contributing to this on-going progress and particularly focusing on computational methods, this special issue was created to encourage research in novel approaches for analyzing biomedical text. The six papers selected for the issue offer a diversity of novel methods that leverage biomedical text for research and clinical uses. A well-established practice in pretraining deep learning models for biomedical applications has been to adopt a most promising model that was already pretrained on general domain natural language corpus and then “add” additional pre-training with biomedical corpora. In “Domain-specific language model pretraining for biomedical natural language processing”, Gu et al. successfully challenge this approach. The authors conducted an experiment where multiple standard benchmarks were used to compare a model that was pre-trained entirely and only on biomedical corpus with models that were pretrained using the “add” on approach. Results showed an impressive improvement in favor of pretraining only with biomedical corpus. The study provides an excellent data-point in support of clarity in model training rather than accumulation. Tariq et al. also find using domain-aware tokenization and embeddings to be more effective in their paper “Bridging the Gap Between Structured and Free-form Radiology Reporting: A Case-study on Coronary CT Angiography”. They compare a variety of models constructed to predict the severity of cardiovascular disease from the language used within free-text radiology reports. Models that used medical-domain-aware tokenization and word embeddings of the reports were consistently more effective than raw word-based. The better models are able to accurately predict disease severity under real-world conditions of diverse terminology from different radiologists and unbalanced class size. Two papers address the problem of maintaining the privacy of clinical documents, though from widely different perspectives. De-identification is the most used approach to eliminate PHI (Protected Health Information) in clinical documents before making the data available to NLP researchers. In “A Context-enhanced De-identification System”, Kahyun et al. describe an improved de-identification technique for clinical records. Their context-enhanced de-identification system called CEDI uses attention mechanisms in a long short-term memory (LSTM) network to capture the appropriate context. This context allows the system to detect dependencies that cross sentence boundaries, an important feature since clinical reports often contain such dependencies. Nonetheless, accurate and broad-coverage de-identification of unstructured data remains challenging, and lack of trust in the process (of de-identification) can be a serious limiting factor for data release. In “Differentially Private Medical Texts Generation using Generative Neural Networks”, Aziz et al. take a different approach to dealing with privacy of clinical documents. They propose synthetic generation of clinical documents with high accuracy as a practical alternative. Using self-attention based neural networks and differential privacy (i.e., the ability to control the level of privacy relative to the original document) in their method,","PeriodicalId":288903,"journal":{"name":"ACM Transactions on Computing for Healthcare (HEALTH)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computing for Healthcare (HEALTH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

It is now well established that biomedical text requires methods targeted for the domain. Developments in deep learning and a series of successful shared challenges have contributed to a steady progress in techniques for natural language processing of biomedical text. Contributing to this on-going progress and particularly focusing on computational methods, this special issue was created to encourage research in novel approaches for analyzing biomedical text. The six papers selected for the issue offer a diversity of novel methods that leverage biomedical text for research and clinical uses. A well-established practice in pretraining deep learning models for biomedical applications has been to adopt a most promising model that was already pretrained on general domain natural language corpus and then “add” additional pre-training with biomedical corpora. In “Domain-specific language model pretraining for biomedical natural language processing”, Gu et al. successfully challenge this approach. The authors conducted an experiment where multiple standard benchmarks were used to compare a model that was pre-trained entirely and only on biomedical corpus with models that were pretrained using the “add” on approach. Results showed an impressive improvement in favor of pretraining only with biomedical corpus. The study provides an excellent data-point in support of clarity in model training rather than accumulation. Tariq et al. also find using domain-aware tokenization and embeddings to be more effective in their paper “Bridging the Gap Between Structured and Free-form Radiology Reporting: A Case-study on Coronary CT Angiography”. They compare a variety of models constructed to predict the severity of cardiovascular disease from the language used within free-text radiology reports. Models that used medical-domain-aware tokenization and word embeddings of the reports were consistently more effective than raw word-based. The better models are able to accurately predict disease severity under real-world conditions of diverse terminology from different radiologists and unbalanced class size. Two papers address the problem of maintaining the privacy of clinical documents, though from widely different perspectives. De-identification is the most used approach to eliminate PHI (Protected Health Information) in clinical documents before making the data available to NLP researchers. In “A Context-enhanced De-identification System”, Kahyun et al. describe an improved de-identification technique for clinical records. Their context-enhanced de-identification system called CEDI uses attention mechanisms in a long short-term memory (LSTM) network to capture the appropriate context. This context allows the system to detect dependencies that cross sentence boundaries, an important feature since clinical reports often contain such dependencies. Nonetheless, accurate and broad-coverage de-identification of unstructured data remains challenging, and lack of trust in the process (of de-identification) can be a serious limiting factor for data release. In “Differentially Private Medical Texts Generation using Generative Neural Networks”, Aziz et al. take a different approach to dealing with privacy of clinical documents. They propose synthetic generation of clinical documents with high accuracy as a practical alternative. Using self-attention based neural networks and differential privacy (i.e., the ability to control the level of privacy relative to the original document) in their method,

查看原文本刊更多论文

生物医学NLP计算方法特刊导论

现在已经确定，生物医学文本需要针对该领域的方法。深度学习的发展和一系列成功的共同挑战促进了生物医学文本自然语言处理技术的稳步发展。为了促进这一持续的进展，特别是关注计算方法，本期特刊的创建是为了鼓励对分析生物医学文本的新方法的研究。为该问题选择的六篇论文提供了多种利用生物医学文本进行研究和临床应用的新方法。在生物医学应用深度学习模型的预训练中，一个公认的做法是采用一个最有前途的模型，该模型已经在一般领域的自然语言语料库上进行了预训练，然后在生物医学语料库上“添加”额外的预训练。在“针对生物医学自然语言处理的领域特定语言模型预训练”中，Gu等人成功地挑战了这种方法。作者进行了一个实验，使用多个标准基准来比较完全预训练的模型，只在生物医学语料库上与使用“add”on方法预训练的模型。结果显示，仅使用生物医学语料库进行预训练的效果显著改善。该研究提供了一个很好的数据点，支持清晰的模型训练，而不是积累。Tariq等人在他们的论文《弥合结构化和自由形式放射学报告之间的差距:冠状动脉CT血管造影案例研究》中也发现，使用域感知标记化和嵌入更有效。他们比较了根据自由文本放射学报告中使用的语言来预测心血管疾病严重程度的各种模型。使用医学领域感知的标记化和报告的单词嵌入的模型始终比原始的基于单词的模型更有效。更好的模型能够在不同放射科医生的不同术语和不平衡的班级规模的现实条件下准确预测疾病的严重程度。两篇论文解决了维护临床文件隐私的问题，尽管从广泛不同的角度。在向NLP研究人员提供数据之前，去识别是消除临床文件中PHI(受保护的健康信息)的最常用方法。在“上下文增强的去识别系统”中，Kahyun等人描述了一种改进的临床记录去识别技术。他们的情境增强去识别系统称为CEDI，使用长短期记忆(LSTM)网络中的注意机制来捕捉适当的情境。这个上下文允许系统检测跨句子边界的依赖关系，这是一个重要的特性，因为临床报告经常包含这种依赖关系。尽管如此，非结构化数据的准确和广泛的去标识化仍然具有挑战性，并且对(去标识化)过程缺乏信任可能是数据发布的严重限制因素。Aziz等人在“使用生成式神经网络生成差异化私有医学文本”一文中采用了不同的方法来处理临床文件的隐私。他们提出以高准确度合成临床文件作为一种实用的替代方法。在他们的方法中使用基于自我关注的神经网络和差分隐私(即控制相对于原始文档的隐私级别的能力)，

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Computing for Healthcare (HEALTH)

自引率

0.00%

发文量