Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-05-14 eCollection Date: 2025-06-01 DOI:10.1093/jamiaopen/ooaf024

Udo Hahn

{"title":"Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.","authors":"Udo Hahn","doi":"10.1093/jamiaopen/ooaf024","DOIUrl":null,"url":null,"abstract":"Objective: We survey clinical document corpora, with a focus on German textual data. Due to rigid data privacy legislation in Germany, these resources, with only few exceptions, are stored in protected clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing, where easy accessibility and reuse of (textual) data collections are common practice. Hence, alternative corpus designs have been examined to escape from data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several types of domain proxies have come up as substitutes for real clinical documents. Common instances of close proxies are medical journal publications, therapy guidelines, drug labels, etc., more distant proxies include medical contents from social media channels or online encyclopedic medical articles.Methods: We follow the PRISM (Preferred Reporting Items for Systematic reviews and Meta-analyses) guidelines for surveying the field of German-language clinical/medical corpora. Four bibliographic databases were searched: PubMed, ACL Anthology, Google Scholar, and the author's personal literature database.Results: After PRISM-conformant identification of 362 hits from the 4 bibliographic systems, the screening process yielded 78 relevant documents for inclusion in this review. They contained overall 92 different published versions of corpora, from which 71 were truly unique in terms of their underlying document sets. Out of these, the majority were clinical corpora-46 real ones from which 32 were unique, 5 translated ones (3 unique), and 6 synthetic ones (3 unique). As to domain proxies, we identified 18 close ones (16 unique) and 17 distant ones (all of them unique).Discussion: There is a clear divide between the large number of non-accessible real clinical German-language corpora and their publicly accessible substitutes: translated or synthetic datasets, close or more distant proxies. So, at first sight, the data bottleneck seems broken. Intuitively, yet, differences in genre-specific writing style, lexical or terminological diction, and required medical background expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are. A systematic, empirically grounded yardstick for comparing real clinical corpora with those suggested substitutes and proxies is missing until now.Conclusion: The extreme sparsity of real clinical corpora in almost all non-Anglo-American countries worldwide, Germany in particular, has triggered an active search for alternative, publicly accessible data resources laid out in this survey. However, the utility of these substitutes compared with real clinical corpora and their semantic and genre-specific distance to real clinical corpora is still under-researched so that their value remains to be assessed properly. Furthermore, corpus descriptions are often incomplete with respect to relevant descriptional attributes. This paper bundles these observations and proposes a template for a so-called corpus card to improve adequate corpus documentation.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 3","pages":"ooaf024"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12077144/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: We survey clinical document corpora, with a focus on German textual data. Due to rigid data privacy legislation in Germany, these resources, with only few exceptions, are stored in protected clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing, where easy accessibility and reuse of (textual) data collections are common practice. Hence, alternative corpus designs have been examined to escape from data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several types of domain proxies have come up as substitutes for real clinical documents. Common instances of close proxies are medical journal publications, therapy guidelines, drug labels, etc., more distant proxies include medical contents from social media channels or online encyclopedic medical articles.

Methods: We follow the PRISM (Preferred Reporting Items for Systematic reviews and Meta-analyses) guidelines for surveying the field of German-language clinical/medical corpora. Four bibliographic databases were searched: PubMed, ACL Anthology, Google Scholar, and the author's personal literature database.

Results: After PRISM-conformant identification of 362 hits from the 4 bibliographic systems, the screening process yielded 78 relevant documents for inclusion in this review. They contained overall 92 different published versions of corpora, from which 71 were truly unique in terms of their underlying document sets. Out of these, the majority were clinical corpora-46 real ones from which 32 were unique, 5 translated ones (3 unique), and 6 synthetic ones (3 unique). As to domain proxies, we identified 18 close ones (16 unique) and 17 distant ones (all of them unique).

Discussion: There is a clear divide between the large number of non-accessible real clinical German-language corpora and their publicly accessible substitutes: translated or synthetic datasets, close or more distant proxies. So, at first sight, the data bottleneck seems broken. Intuitively, yet, differences in genre-specific writing style, lexical or terminological diction, and required medical background expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are. A systematic, empirically grounded yardstick for comparing real clinical corpora with those suggested substitutes and proxies is missing until now.

Conclusion: The extreme sparsity of real clinical corpora in almost all non-Anglo-American countries worldwide, Germany in particular, has triggered an active search for alternative, publicly accessible data resources laid out in this survey. However, the utility of these substitutes compared with real clinical corpora and their semantic and genre-specific distance to real clinical corpora is still under-researched so that their value remains to be assessed properly. Furthermore, corpus descriptions are often incomplete with respect to relevant descriptional attributes. This paper bundles these observations and proposes a template for a so-called corpus card to improve adequate corpus documentation.

Abstract Image

查看原文本刊更多论文

临床文献语料库-真实的，翻译的和合成的替代品，以及分类的领域代理：语料库设计多样性的调查，重点是德语文本数据。

目的：调查临床文献语料库，重点是德语文本数据。由于德国严格的数据隐私立法，除了少数例外，这些资源都存储在受保护的临床数据空间中，并对临床外部研究人员锁定。这种情况与自然语言处理领域中已建立的工作流形成鲜明对比，在自然语言处理领域中，易于访问和重用（文本）数据集合是常见的做法。因此，已经研究了替代语料库设计以避免数据贫乏。除了英文临床数据集的机器翻译和虚构临床内容的合成语料库的生成之外，还出现了几种类型的领域代理作为真实临床文档的替代品。近距离代理的常见例子是医学期刊出版物、治疗指南、药物标签等，较远的代理包括来自社交媒体渠道或在线百科全书式医学文章的医疗内容。方法：我们遵循PRISM（系统评价和荟萃分析的首选报告项目）指南调查德语临床/医学语料库领域。检索了四个书目数据库：PubMed、ACL Anthology、谷歌Scholar和作者个人文献数据库。结果：从4个文献系统中筛选出362个符合prism标准的结果后，筛选过程中产生了78个相关文献纳入本综述。它们包含了总共92个不同版本的语料库，其中71个在其基础文档集方面是真正独特的。其中以临床语料库为主，46份真实语料库32份唯一，5份翻译语料库3份唯一，6份合成语料库3份唯一。至于域代理，我们确定了18个近代理（16个唯一的）和17个远代理（它们都是唯一的）。讨论：在大量无法访问的真实临床德语语料库和它们的可公开访问的替代品之间存在明显的鸿沟：翻译或合成数据集，近或更远的代理。因此，乍一看，数据瓶颈似乎被打破了。然而，从直观上看，在特定体裁的写作风格、词汇或术语的措辞以及在这一类型学领域所需的医学背景专业知识方面的差异也是显而易见的。这就提出了一个问题，替代语料库设计到底有多有效。到目前为止，还没有一个系统的、基于经验的标准来比较真实的临床语料库与那些建议的替代品和代理。结论：在世界上几乎所有非英美国家，尤其是德国，真正的临床语料库极度稀少，这引发了对本调查中列出的可公开访问的替代数据资源的积极搜索。然而，这些替代品与真实临床语料库的效用及其与真实临床语料库的语义和体裁特定距离仍有待研究，因此它们的价值仍有待适当评估。此外，语料库描述在相关描述属性方面往往是不完整的。本文将这些观察结果捆绑在一起，并提出了一个所谓的语料库卡片模板，以改进足够的语料库文档。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊