新的人工智能辅助数据标准加快了生物医学研究的互操作性。

medRxiv : the preprint server for health sciences Pub Date : 2024-10-17 DOI:10.1101/2024.10.17.24315618

Rodney Alan Long, Shannon Ballard, Syed Shah, Owen Bianchi, Lietsel Jones, Mathew J Koretsky, Nicole Kuznetsov, Elise Marsan, Bryant Jen, Phillip Chiang, Abhradeep Mukherjee, Cornelis Blauwendraat, Hampton Leonard, Dan Vitale, Kristin Levine, Sara Bandres-Ciga, Paige Jarreau, Patrick Brannely, Caroline Pantazis, Laurel Screven, Kate Andersh, Alifiya Kapasi, John F Crary, David Gutman, Brittany N Dugger, Sarah Biber, Tim Hohman, Faraz Faghri, Michael Griswold, Lana Sargent, Kendall van Keuren-Jensen, Andrew B Singleton, Yang Fann, Mike A Nalls, Hirotaka Iwaki

{"title":"新的人工智能辅助数据标准加快了生物医学研究的互操作性。","authors":"Rodney Alan Long, Shannon Ballard, Syed Shah, Owen Bianchi, Lietsel Jones, Mathew J Koretsky, Nicole Kuznetsov, Elise Marsan, Bryant Jen, Phillip Chiang, Abhradeep Mukherjee, Cornelis Blauwendraat, Hampton Leonard, Dan Vitale, Kristin Levine, Sara Bandres-Ciga, Paige Jarreau, Patrick Brannely, Caroline Pantazis, Laurel Screven, Kate Andersh, Alifiya Kapasi, John F Crary, David Gutman, Brittany N Dugger, Sarah Biber, Tim Hohman, Faraz Faghri, Michael Griswold, Lana Sargent, Kendall van Keuren-Jensen, Andrew B Singleton, Yang Fann, Mike A Nalls, Hirotaka Iwaki","doi":"10.1101/2024.10.17.24315618","DOIUrl":null,"url":null,"abstract":"In this paper, we leveraged Large Language Models(LLMs) to accelerate data wrangling and automate labor-intensive aspects of data discovery and harmonization. This work promotes interoperability standards and enhances data discovery, facilitating AI-readiness in biomedical science with the generation of Common Data Elements (CDEs) as key to harmonizing multiple datasets. Thirty-one studies, various ontologies, and medical coding systems served as source material to create CDEs from which available metadata and context was sent as an API request to 4th-generation OpenAI GPT models to populate each metadata field. A human-in-the-loop (HITL) approach was used to assess quality and accuracy of the generated CDEs. To regulate CDE generation, we employed ElasticSearch and HITL to avoid duplicate CDEs and instead, added them as potential aliases for existing CDEs. The generated CDEs are foundational to assess the interoperability potential of datasets by determining how many data set column headers can be correctly mapped to CDEs as well as quantifying compliance with permissible values and data types. Subject matter experts reviewed generated CDEs and determined that 94.0% of generated metadata fields did not require manual revisions. Data tables from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Global Parkinson's Genetic Program (GP2) were used as test cases for interoperability assessments. Column headers from all test cases were successfully mapped to generated CDEs at a rate of 32.4% via elastic search.The interoperability score, a metric for dataset compatibility to CDEs and other connected datasets, based on relevant criteria such as data field completeness and compliance with common harmonization standards averaged 53.8 out of 100 for test cases. With this project, we aim to automate the most tedious aspects of data harmonization, enhancing efficiency and scalability in biomedical research while decreasing activation energy for federated research.","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11527042/pdf/","citationCount":"0","resultStr":"{\"title\":\"A new AI-assisted data standard accelerates interoperability in biomedical research.\",\"authors\":\"Rodney Alan Long, Shannon Ballard, Syed Shah, Owen Bianchi, Lietsel Jones, Mathew J Koretsky, Nicole Kuznetsov, Elise Marsan, Bryant Jen, Phillip Chiang, Abhradeep Mukherjee, Cornelis Blauwendraat, Hampton Leonard, Dan Vitale, Kristin Levine, Sara Bandres-Ciga, Paige Jarreau, Patrick Brannely, Caroline Pantazis, Laurel Screven, Kate Andersh, Alifiya Kapasi, John F Crary, David Gutman, Brittany N Dugger, Sarah Biber, Tim Hohman, Faraz Faghri, Michael Griswold, Lana Sargent, Kendall van Keuren-Jensen, Andrew B Singleton, Yang Fann, Mike A Nalls, Hirotaka Iwaki\",\"doi\":\"10.1101/2024.10.17.24315618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we leveraged Large Language Models(LLMs) to accelerate data wrangling and automate labor-intensive aspects of data discovery and harmonization. This work promotes interoperability standards and enhances data discovery, facilitating AI-readiness in biomedical science with the generation of Common Data Elements (CDEs) as key to harmonizing multiple datasets. Thirty-one studies, various ontologies, and medical coding systems served as source material to create CDEs from which available metadata and context was sent as an API request to 4th-generation OpenAI GPT models to populate each metadata field. A human-in-the-loop (HITL) approach was used to assess quality and accuracy of the generated CDEs. To regulate CDE generation, we employed ElasticSearch and HITL to avoid duplicate CDEs and instead, added them as potential aliases for existing CDEs. The generated CDEs are foundational to assess the interoperability potential of datasets by determining how many data set column headers can be correctly mapped to CDEs as well as quantifying compliance with permissible values and data types. Subject matter experts reviewed generated CDEs and determined that 94.0% of generated metadata fields did not require manual revisions. Data tables from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Global Parkinson's Genetic Program (GP2) were used as test cases for interoperability assessments. Column headers from all test cases were successfully mapped to generated CDEs at a rate of 32.4% via elastic search.The interoperability score, a metric for dataset compatibility to CDEs and other connected datasets, based on relevant criteria such as data field completeness and compliance with common harmonization standards averaged 53.8 out of 100 for test cases. With this project, we aim to automate the most tedious aspects of data harmonization, enhancing efficiency and scalability in biomedical research while decreasing activation energy for federated research.\",\"PeriodicalId\":94281,\"journal\":{\"name\":\"medRxiv : the preprint server for health sciences\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11527042/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv : the preprint server for health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.10.17.24315618\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.10.17.24315618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们利用大型语言模型（LLMs）来加速数据整理，并自动处理数据发现和协调的劳动密集型环节。这项工作促进了互操作性标准并增强了数据发现，通过生成通用数据元素（CDE）作为协调多个数据集的关键，促进了生物医学科学中的人工智能准备工作。31 项研究、各种本体论和医疗编码系统是创建 CDE 的源材料，其中可用的元数据和上下文作为 API 请求发送给第四代 OpenAI GPT 模型，以填充每个元数据字段。我们采用人在回路（HITL）方法来评估生成 CDE 的质量和准确性。为了规范 CDE 的生成，我们使用了 ElasticSearch 和 HITL 来避免重复的 CDE，而是将它们添加为现有 CDE 的潜在别名。通过确定有多少数据集列标题可以正确映射到 CDE，以及量化允许值和数据类型的合规性，生成的 CDE 是评估数据集互操作性潜力的基础。主题专家审查了生成的 CDE，确定 94.0% 的生成元数据字段不需要手动修改。阿尔茨海默病神经影像计划（ADNI）和全球帕金森病基因计划（GP2）的数据表被用作互操作性评估的测试案例。互操作性评分是衡量数据集与 CDE 和其他连接数据集兼容性的指标，基于数据字段完整性和是否符合通用协调标准等相关标准，测试用例的平均评分为 53.8 分（满分 100 分）。通过这个项目，我们的目标是将数据协调中最繁琐的环节自动化，提高生物医学研究的效率和可扩展性，同时降低联合研究的激活能量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A new AI-assisted data standard accelerates interoperability in biomedical research.

In this paper, we leveraged Large Language Models(LLMs) to accelerate data wrangling and automate labor-intensive aspects of data discovery and harmonization. This work promotes interoperability standards and enhances data discovery, facilitating AI-readiness in biomedical science with the generation of Common Data Elements (CDEs) as key to harmonizing multiple datasets. Thirty-one studies, various ontologies, and medical coding systems served as source material to create CDEs from which available metadata and context was sent as an API request to 4th-generation OpenAI GPT models to populate each metadata field. A human-in-the-loop (HITL) approach was used to assess quality and accuracy of the generated CDEs. To regulate CDE generation, we employed ElasticSearch and HITL to avoid duplicate CDEs and instead, added them as potential aliases for existing CDEs. The generated CDEs are foundational to assess the interoperability potential of datasets by determining how many data set column headers can be correctly mapped to CDEs as well as quantifying compliance with permissible values and data types. Subject matter experts reviewed generated CDEs and determined that 94.0% of generated metadata fields did not require manual revisions. Data tables from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Global Parkinson's Genetic Program (GP2) were used as test cases for interoperability assessments. Column headers from all test cases were successfully mapped to generated CDEs at a rate of 32.4% via elastic search.The interoperability score, a metric for dataset compatibility to CDEs and other connected datasets, based on relevant criteria such as data field completeness and compliance with common harmonization standards averaged 53.8 out of 100 for test cases. With this project, we aim to automate the most tedious aspects of data harmonization, enhancing efficiency and scalability in biomedical research while decreasing activation energy for federated research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

medRxiv : the preprint server for health sciences

自引率

0.00%

发文量