Christina Lohr, Jakob Faller, Andrea Riedel, Hung Manh Nguyen, Markus Wolfien, Justin Hofenbitzer, Luise Modersohn, Jutta Romberg, Fabian Prasser, Jazia Omeirat, Yutong Wen, Oksana Galusch, Udo Hahn, Marvin Seiferling, Christoph Dieterich, Peter Klügl, Franz Matthies, Janina Kind, Martin Boeker, Markus Löffler, Frank Meineke
{"title":"GeMTeX在行动中的去身份化:经验教训和魔鬼的细节。","authors":"Christina Lohr, Jakob Faller, Andrea Riedel, Hung Manh Nguyen, Markus Wolfien, Justin Hofenbitzer, Luise Modersohn, Jutta Romberg, Fabian Prasser, Jazia Omeirat, Yutong Wen, Oksana Galusch, Udo Hahn, Marvin Seiferling, Christoph Dieterich, Peter Klügl, Franz Matthies, Janina Kind, Martin Boeker, Markus Löffler, Frank Meineke","doi":"10.3233/SHTI251406","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.</p><p><strong>Methods: </strong>GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.</p><p><strong>Results: </strong>We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.</p><p><strong>Conclusion: </strong>We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"331 ","pages":"274-282"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details.\",\"authors\":\"Christina Lohr, Jakob Faller, Andrea Riedel, Hung Manh Nguyen, Markus Wolfien, Justin Hofenbitzer, Luise Modersohn, Jutta Romberg, Fabian Prasser, Jazia Omeirat, Yutong Wen, Oksana Galusch, Udo Hahn, Marvin Seiferling, Christoph Dieterich, Peter Klügl, Franz Matthies, Janina Kind, Martin Boeker, Markus Löffler, Frank Meineke\",\"doi\":\"10.3233/SHTI251406\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.</p><p><strong>Methods: </strong>GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.</p><p><strong>Results: </strong>We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.</p><p><strong>Conclusion: </strong>We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.</p>\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"331 \",\"pages\":\"274-282\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251406\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251406","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GeMTeX's De-Identification in Action: Lessons Learned & Devil's Details.
Introduction: In 2024, the GeMTeX project launched the largest ever de-identification campaign for German-language clinical reports, and, as a pilot study, published GraSCCoPHI, the first de-identified German-language gold standard corpus of synthetic discharge summaries.
Methods: GeMTeX's de-identification workflow is described here - including annotation tool management and, pre-annotation experience, such as assembling and training annotation groups and the evolution of guidelines.
Results: We present the project's progress in the first year with respect to de-identification efforts and the challenges we faced during the rollout at six hospital sites in four German states. The refinement of the annotation guidelines became an ongoing process, often with unforeseen hurdles to overcome as we moved from testing to production. From our current internal interim corpus (9,000 documents with about 20 million tokens), we are publishing the first quantitative insights, such as the average amount of identifiable information per document, a list of confounding factors we did not anticipate at the beginning of the project, and three key lessons learned.
Conclusion: We note that the unforeseen hurdles behave like the Pareto principle and fall into the set of less than 20% of the annotations.