Catnip for MedCAT: Optimizing the Input for Automated SNOMED CT Mapping of Clinical Variables.

Studies in health technology and informatics Pub Date : 2025-09-03 DOI:10.3233/SHTI251390

Julia Gehrmann, Asme Dogan, Lea Hagelschuer, Lars Quakulinski, Anne Koy, Oya Beyan

{"title":"Catnip for MedCAT: Optimizing the Input for Automated SNOMED CT Mapping of Clinical Variables.","authors":"Julia Gehrmann, Asme Dogan, Lea Hagelschuer, Lars Quakulinski, Anne Koy, Oya Beyan","doi":"10.3233/SHTI251390","DOIUrl":null,"url":null,"abstract":"Introduction: Mapping local medical data assets to international data standards such as medical ontology SNOMED CT fosters data harmonization and, thereby, global progress in medical research. Since its intense resource requirements often hinder manual SNOMED CT mapping, automated mapping tools such as MedCAT have been developed. We investigated how the formulation of study variable names (VNs) influences the efficacy and accuracy of the SNOMED CT concepts identified by MedCAT.Methods: We extracted 763 VNs from the GEPESTIM database hosted locally in REDCap and created three VNs using different REDCap metadata items for MedCAT-based SNOMED CT mapping. A fourth VN version was created manually. The mapping was evaluated based on the number and quality of identified SNOMED CT concepts, using manual scoring to assess concept accuracy while ensuring a blind evaluation process.Results: Increasing the expressiveness of VNs by adding more metadata items led to more SNOMED CT concepts being mapped, but also introduced mismatches, particularly when additionally included metadata contained misleading terms. The best overall mapping performance was achieved on the manually specified VNs while a basic VN version with minimal extra information from the metadata resulted in similarly good results.Conclusion: Our study identified key challenges in using MedCAT for automatically mapping study variables to SNOMED CT concepts. To improve accuracy, we recommend refining VNs reducing misleading terms and iteratively improving VN phrasing for optimal mapping outcome. Furthermore, it appears reasonable to always conduct a final manual review of the mapping outcome especially for critical variables and for those VNs containing negations or abbreviations.","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"331 ","pages":"142-152"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Mapping local medical data assets to international data standards such as medical ontology SNOMED CT fosters data harmonization and, thereby, global progress in medical research. Since its intense resource requirements often hinder manual SNOMED CT mapping, automated mapping tools such as MedCAT have been developed. We investigated how the formulation of study variable names (VNs) influences the efficacy and accuracy of the SNOMED CT concepts identified by MedCAT.

Methods: We extracted 763 VNs from the GEPESTIM database hosted locally in REDCap and created three VNs using different REDCap metadata items for MedCAT-based SNOMED CT mapping. A fourth VN version was created manually. The mapping was evaluated based on the number and quality of identified SNOMED CT concepts, using manual scoring to assess concept accuracy while ensuring a blind evaluation process.

Results: Increasing the expressiveness of VNs by adding more metadata items led to more SNOMED CT concepts being mapped, but also introduced mismatches, particularly when additionally included metadata contained misleading terms. The best overall mapping performance was achieved on the manually specified VNs while a basic VN version with minimal extra information from the metadata resulted in similarly good results.

Conclusion: Our study identified key challenges in using MedCAT for automatically mapping study variables to SNOMED CT concepts. To improve accuracy, we recommend refining VNs reducing misleading terms and iteratively improving VN phrasing for optimal mapping outcome. Furthermore, it appears reasonable to always conduct a final manual review of the mapping outcome especially for critical variables and for those VNs containing negations or abbreviations.

查看原文本刊更多论文

MedCAT的猫薄荷：优化临床变量自动SNOMED CT映射的输入。

导言：将本地医疗数据资产映射到国际数据标准（如医学本体SNOMED CT）可以促进数据协调，从而促进全球医学研究的进展。由于其强烈的资源需求经常阻碍手动SNOMED CT制图，因此开发了MedCAT等自动化制图工具。我们研究了研究变量名称（VNs）的制定如何影响MedCAT识别的SNOMED CT概念的有效性和准确性。方法：我们从本地托管于REDCap的GEPESTIM数据库中提取了763个VNs，并使用不同的REDCap元数据项创建了三个VNs，用于基于medcat的SNOMED CT制图。手动创建了第四个VN版本。根据已识别的SNOMED CT概念的数量和质量对映射进行评估，在确保盲评估过程的同时，使用手动评分来评估概念的准确性。结果：通过添加更多的元数据项来增加vn的表达性，导致更多的SNOMED CT概念被映射，但也引入了不匹配，特别是当额外包含误导性术语的元数据时。在手动指定的VN上实现了最佳的总体映射性能，而具有最小元数据额外信息的基本VN版本产生了类似的良好结果。结论：我们的研究确定了使用MedCAT自动将研究变量映射到SNOMED CT概念的关键挑战。为了提高准确性，我们建议改进虚拟网络，减少误导性术语，并迭代改进虚拟网络措辞，以获得最佳映射结果。此外，似乎总是对映射结果进行最后的人工审查是合理的，特别是对于关键变量和那些包含否定或缩写的vn。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Studies in health technology and informatics

自引率

0.00%

发文量