GeMTeX在行动中的去身份化：经验教训和魔鬼的细节。

Studies in health technology and informatics Pub Date : 2025-09-03 DOI:10.3233/SHTI251406

Christina Lohr, Jakob Faller, Andrea Riedel, Hung Manh Nguyen, Markus Wolfien, Justin Hofenbitzer, Luise Modersohn, Jutta Romberg, Fabian Prasser, Jazia Omeirat, Yutong Wen, Oksana Galusch, Udo Hahn, Marvin Seiferling, Christoph Dieterich, Peter Klügl, Franz Matthies, Janina Kind, Martin Boeker, Markus Löffler, Frank Meineke

{"title":"GeMTeX在行动中的去身份化：经验教训和魔鬼的细节。","authors":"Christina Lohr, Jakob Faller, Andrea Riedel, Hung Manh Nguyen, Markus Wolfien, Justin Hofenbitzer, Luise Modersohn, Jutta Romberg, Fabian Prasser, Jazia Omeirat, Yutong Wen, Oksana Galusch, Udo Hahn, Marvin Seiferling, Christoph Dieterich, Peter Klügl, Franz Matthies, Janina Kind, Martin Boeker, Markus Löffler, Frank Meineke","doi":"10.3233/SHTI251406","DOIUrl":null,"url":null,"abstract":"Introduction: In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.Methods: GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.Results: We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.Conclusion: We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"331 ","pages":"274-282"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details.\",\"authors\":\"Christina Lohr, Jakob Faller, Andrea Riedel, Hung Manh Nguyen, Markus Wolfien, Justin Hofenbitzer, Luise Modersohn, Jutta Romberg, Fabian Prasser, Jazia Omeirat, Yutong Wen, Oksana Galusch, Udo Hahn, Marvin Seiferling, Christoph Dieterich, Peter Klügl, Franz Matthies, Janina Kind, Martin Boeker, Markus Löffler, Frank Meineke\",\"doi\":\"10.3233/SHTI251406\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.Methods: GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.Results: We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.Conclusion: We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"331 \",\"pages\":\"274-282\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251406\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251406","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

简介：在2024年，GeMTeX项目启动了有史以来规模最大的德语临床报告去识别运动，并作为一项试点研究，发布了GraSCCoPHI，这是第一个去识别的德语合成出院摘要金标准语料库。方法：这里描述了GeMTeX的去识别工作流程-包括注释工具管理和注释前的经验，例如组装和培训注释组以及指南的演变。结果：我们介绍了该项目第一年在去识别工作方面的进展情况，以及我们在德国四个州的六家医院站点推出期间面临的挑战。注释指南的细化成为一个持续的过程，在我们从测试转向生产的过程中经常会遇到无法预料的障碍。从我们目前的内部临时语料库（9000个文档，大约2000万个代币）中，我们发布了第一批定量见解，例如每个文档的可识别信息的平均数量，我们在项目开始时没有预料到的混淆因素列表，以及三个关键经验教训。结论：我们注意到，不可预见的障碍表现得像帕累托原则，并且属于少于20%的注释集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details.

Introduction: In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.

Methods: GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.

Results: We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.

Conclusion: We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Studies in health technology and informatics

自引率

0.00%

发文量