How good is your synthetic data? SynthRO, a dashboard to evaluate and benchmark synthetic tabular data.

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-02-18 DOI:10.1186/s12911-024-02731-9

Gabriele Santangelo, Giovanna Nicora, Riccardo Bellazzi, Arianna Dagliati

{"title":"How good is your synthetic data? SynthRO, a dashboard to evaluate and benchmark synthetic tabular data.","authors":"Gabriele Santangelo, Giovanna Nicora, Riccardo Bellazzi, Arianna Dagliati","doi":"10.1186/s12911-024-02731-9","DOIUrl":null,"url":null,"abstract":"Background: The exponential growth in patient data collection by healthcare providers, governments, and private industries is yielding large and diverse datasets that offer new insights into critical medical questions. Leveraging extensive computational resources, Machine Learning and Artificial Intelligence are increasingly utilized to address health-related issues, such as predicting outcomes from Electronic Health Records and detecting patterns in multi-omics data. Despite the proliferation of medical devices based on Artificial Intelligence, data accessibility for research is limited due to privacy concerns. Efforts to de-identify data have met challenges in maintaining effectiveness, particularly with large datasets. As an alternative, synthetic data, that replicate main statistical properties of real patient data, are proposed. However, the lack of standardized evaluation metrics complicates the selection of appropriate synthetic data generation methods. Effective evaluation of synthetic data must consider resemblance, utility and privacy, tailored to specific applications. Despite available metrics, benchmarking efforts remain limited, necessitating further research in this area.Results: We present SynthRO (Synthetic data Rank and Order), a user-friendly tool for benchmarking health synthetic tabular data across various contexts. SynthRO offers accessible quality evaluation metrics and automated benchmarking, helping users determine the most suitable synthetic data models for specific use cases by prioritizing metrics and providing consistent quantitative scores. Our dashboard is divided into three main sections: (1) Loading Data section, where users can locally upload real and synthetic datasets; (2) Evaluation section, in which several quality assessments are performed by computing different metrics and measures; (3) Benchmarking section, where users can globally compare synthetic datasets based on quality evaluation.Conclusions: Synthetic data mitigate concerns about privacy and data accessibility, yet lacks standardized evaluation metrics. SynthRO provides an accessible dashboard helping users select suitable synthetic data models, and it also supports various use cases in healthcare, enhancing prognostic scores and enabling federated learning. SynthRO's accessible GUI and modular structure facilitate effective data evaluation, promoting reliability and fairness. Future developments will include temporal data evaluation, further broadening its applicability.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"89"},"PeriodicalIF":3.3000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11837667/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02731-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The exponential growth in patient data collection by healthcare providers, governments, and private industries is yielding large and diverse datasets that offer new insights into critical medical questions. Leveraging extensive computational resources, Machine Learning and Artificial Intelligence are increasingly utilized to address health-related issues, such as predicting outcomes from Electronic Health Records and detecting patterns in multi-omics data. Despite the proliferation of medical devices based on Artificial Intelligence, data accessibility for research is limited due to privacy concerns. Efforts to de-identify data have met challenges in maintaining effectiveness, particularly with large datasets. As an alternative, synthetic data, that replicate main statistical properties of real patient data, are proposed. However, the lack of standardized evaluation metrics complicates the selection of appropriate synthetic data generation methods. Effective evaluation of synthetic data must consider resemblance, utility and privacy, tailored to specific applications. Despite available metrics, benchmarking efforts remain limited, necessitating further research in this area.

Results: We present SynthRO (Synthetic data Rank and Order), a user-friendly tool for benchmarking health synthetic tabular data across various contexts. SynthRO offers accessible quality evaluation metrics and automated benchmarking, helping users determine the most suitable synthetic data models for specific use cases by prioritizing metrics and providing consistent quantitative scores. Our dashboard is divided into three main sections: (1) Loading Data section, where users can locally upload real and synthetic datasets; (2) Evaluation section, in which several quality assessments are performed by computing different metrics and measures; (3) Benchmarking section, where users can globally compare synthetic datasets based on quality evaluation.

Conclusions: Synthetic data mitigate concerns about privacy and data accessibility, yet lacks standardized evaluation metrics. SynthRO provides an accessible dashboard helping users select suitable synthetic data models, and it also supports various use cases in healthcare, enhancing prognostic scores and enabling federated learning. SynthRO's accessible GUI and modular structure facilitate effective data evaluation, promoting reliability and fairness. Future developments will include temporal data evaluation, further broadening its applicability.

查看原文本刊更多论文

你的合成数据有多好？SynthRO，用于评估和基准合成表格数据的仪表板。

背景：医疗保健提供者、政府和私营企业收集的患者数据呈指数级增长，产生了庞大而多样的数据集，为关键医疗问题提供了新的见解。利用广泛的计算资源，机器学习和人工智能越来越多地用于解决与健康相关的问题，例如预测电子健康记录的结果和检测多组学数据中的模式。尽管基于人工智能的医疗设备越来越多，但由于隐私问题，研究数据的可访问性受到限制。去识别数据的努力在保持有效性方面遇到了挑战，特别是对于大型数据集。作为替代，合成数据，复制真实患者数据的主要统计特性，被提出。然而，缺乏标准化的评价指标使选择适当的合成数据生成方法变得复杂。对合成数据的有效评估必须考虑到相似性、实用性和隐私性，并为特定应用量身定制。尽管有可用的度量标准，但基准工作仍然有限，需要在该领域进行进一步的研究。结果：我们提出了SynthRO（合成数据排名和顺序），这是一个用户友好的工具，用于对各种环境下的健康合成表格数据进行基准测试。SynthRO提供可访问的质量评估指标和自动基准测试，通过优先化指标和提供一致的定量分数，帮助用户确定最适合特定用例的合成数据模型。我们的仪表板分为三个主要部分：(1)加载数据部分，用户可以在这里本地上传真实和合成的数据集；(2)评价部分，通过计算不同的度量和措施进行若干质量评价；(3) Benchmarking部分，用户可以基于质量评价对合成数据集进行全局比较。结论：合成数据减轻了对隐私和数据可访问性的担忧，但缺乏标准化的评估指标。SynthRO提供了一个可访问的仪表板，帮助用户选择合适的合成数据模型，它还支持医疗保健中的各种用例，增强预后评分并支持联合学习。SynthRO的可访问GUI和模块化结构促进了有效的数据评估，提高了可靠性和公平性。未来的发展将包括时间数据评价，进一步扩大其适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.