PhenoQC：基因组研究中表型数据质量控制的集成工具包

Q1 Medicine

Informatics in Medicine Unlocked Pub Date : 2025-01-01 DOI:10.1016/j.imu.2025.101693

Jorge Miguel Silva, José Luis Oliveira

{"title":"PhenoQC：基因组研究中表型数据质量控制的集成工具包","authors":"Jorge Miguel Silva, José Luis Oliveira","doi":"10.1016/j.imu.2025.101693","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Large-scale genomic research requires robust, consistent phenotypic datasets for meaningful genotype–phenotype correlations. However, diverse collection protocols, incomplete entries, and heterogeneous terminologies frequently compromise data quality and slows downstream analysis.</div></div><div><h3>Methodology:</h3><div>To address these issues, we present PhenoQC, a high-throughput, configuration-driven toolkit that unifies schema validation, ontology-based semantic alignment, and missing-data imputation in a single workflow. Its modular architecture leverages chunk-based parallelism to handle large datasets, while customizable schemas enforce structural and type constraints. PhenoQC applies user-defined and state-of-the-art machine learning-based imputation and performs multi-ontology mapping with fuzzy matching to harmonize phenotype text. It also quantifies potential imputation-induced distributional shifts by reporting standardized mean difference, variance ratio, and Kolmogorov–Smirnov statistics for numeric variables, and population stability index and Cramér’s <span><math><mi>V</mi></math></span> for categorical variables, with user-configurable thresholds. The toolkit provides command-line and graphical interfaces for seamless integration into automated pipelines and interactive curation environments.</div></div><div><h3>Results:</h3><div>We benchmarked PhenoQC on synthetic datasets with up to 100,000 records and it demonstrated near-linear scalability and full recovery of artificially missing numeric values.Moreover, PhenoQC’s ontology alignment achieved over 97% accuracy under textual corruption. Finally, using two real clinical datasets, PhenoQC successfully imputed missing values, enforced schema compliance, and flagged data anomalies without significant overhead.</div></div><div><h3>Conclusions:</h3><div>PhenoQC saves manual curation time and ensures consistent, analysis-ready phenotypic data through its streamlined system. Its adaptable design adjusts to evolving ontologies and domain-specific rules, empowering researchers to conduct more reliable studies.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"58 ","pages":"Article 101693"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PhenoQC: An integrated toolkit for quality control of phenotypic data in genomic research\",\"authors\":\"Jorge Miguel Silva, José Luis Oliveira\",\"doi\":\"10.1016/j.imu.2025.101693\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background:</h3><div>Large-scale genomic research requires robust, consistent phenotypic datasets for meaningful genotype–phenotype correlations. However, diverse collection protocols, incomplete entries, and heterogeneous terminologies frequently compromise data quality and slows downstream analysis.</div></div><div><h3>Methodology:</h3><div>To address these issues, we present PhenoQC, a high-throughput, configuration-driven toolkit that unifies schema validation, ontology-based semantic alignment, and missing-data imputation in a single workflow. Its modular architecture leverages chunk-based parallelism to handle large datasets, while customizable schemas enforce structural and type constraints. PhenoQC applies user-defined and state-of-the-art machine learning-based imputation and performs multi-ontology mapping with fuzzy matching to harmonize phenotype text. It also quantifies potential imputation-induced distributional shifts by reporting standardized mean difference, variance ratio, and Kolmogorov–Smirnov statistics for numeric variables, and population stability index and Cramér’s <span><math><mi>V</mi></math></span> for categorical variables, with user-configurable thresholds. The toolkit provides command-line and graphical interfaces for seamless integration into automated pipelines and interactive curation environments.</div></div><div><h3>Results:</h3><div>We benchmarked PhenoQC on synthetic datasets with up to 100,000 records and it demonstrated near-linear scalability and full recovery of artificially missing numeric values.Moreover, PhenoQC’s ontology alignment achieved over 97% accuracy under textual corruption. Finally, using two real clinical datasets, PhenoQC successfully imputed missing values, enforced schema compliance, and flagged data anomalies without significant overhead.</div></div><div><h3>Conclusions:</h3><div>PhenoQC saves manual curation time and ensures consistent, analysis-ready phenotypic data through its streamlined system. Its adaptable design adjusts to evolving ontologies and domain-specific rules, empowering researchers to conduct more reliable studies.</div></div>\",\"PeriodicalId\":13953,\"journal\":{\"name\":\"Informatics in Medicine Unlocked\",\"volume\":\"58 \",\"pages\":\"Article 101693\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics in Medicine Unlocked\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352914825000826\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

背景：大规模基因组研究需要稳健、一致的表型数据集，以获得有意义的基因型-表型相关性。然而，不同的收集协议、不完整的条目和异构术语经常会影响数据质量，并减慢下游分析的速度。方法：为了解决这些问题，我们提出了PhenoQC，这是一个高吞吐量、配置驱动的工具包，它将模式验证、基于本体的语义对齐和丢失数据的输入统一到一个工作流中。它的模块化架构利用基于块的并行性来处理大型数据集，而可定制的模式则执行结构和类型约束。PhenoQC应用用户定义和最先进的基于机器学习的输入，并执行多本体映射与模糊匹配，以协调表型文本。它还通过报告数字变量的标准化均值差、方差比和Kolmogorov-Smirnov统计数据，以及分类变量的人口稳定指数和cramsamr 's V，以及用户可配置的阈值，来量化潜在的假设引起的分布偏移。该工具包提供命令行和图形界面，用于无缝集成到自动化管道和交互式管理环境中。结果：我们在多达100,000条记录的合成数据集上对PhenoQC进行了基准测试，它展示了近似线性的可扩展性和人为丢失的数值的完全恢复。此外，在文本损坏的情况下，PhenoQC的本体对齐准确率达到了97%以上。最后，使用两个真实的临床数据集，PhenoQC成功地输入了缺失值，强制遵循模式，并标记了数据异常，而没有显著的开销。结论：通过其精简的系统，PhenoQC节省了人工管理时间，并确保了一致的、可分析的表型数据。它的适应性设计适应不断发展的本体和领域特定规则，使研究人员能够进行更可靠的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

PhenoQC: An integrated toolkit for quality control of phenotypic data in genomic research

查看原文本刊更多论文

PhenoQC: An integrated toolkit for quality control of phenotypic data in genomic research

Background:

Large-scale genomic research requires robust, consistent phenotypic datasets for meaningful genotype–phenotype correlations. However, diverse collection protocols, incomplete entries, and heterogeneous terminologies frequently compromise data quality and slows downstream analysis.

Methodology:

To address these issues, we present PhenoQC, a high-throughput, configuration-driven toolkit that unifies schema validation, ontology-based semantic alignment, and missing-data imputation in a single workflow. Its modular architecture leverages chunk-based parallelism to handle large datasets, while customizable schemas enforce structural and type constraints. PhenoQC applies user-defined and state-of-the-art machine learning-based imputation and performs multi-ontology mapping with fuzzy matching to harmonize phenotype text. It also quantifies potential imputation-induced distributional shifts by reporting standardized mean difference, variance ratio, and Kolmogorov–Smirnov statistics for numeric variables, and population stability index and Cramér’s

V

for categorical variables, with user-configurable thresholds. The toolkit provides command-line and graphical interfaces for seamless integration into automated pipelines and interactive curation environments.

Results:

We benchmarked PhenoQC on synthetic datasets with up to 100,000 records and it demonstrated near-linear scalability and full recovery of artificially missing numeric values.Moreover, PhenoQC’s ontology alignment achieved over 97% accuracy under textual corruption. Finally, using two real clinical datasets, PhenoQC successfully imputed missing values, enforced schema compliance, and flagged data anomalies without significant overhead.

Conclusions:

PhenoQC saves manual curation time and ensures consistent, analysis-ready phenotypic data through its streamlined system. Its adaptable design adjusts to evolving ontologies and domain-specific rules, empowering researchers to conduct more reliable studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Informatics in Medicine Unlocked Medicine-Health Informatics

CiteScore

9.50

自引率

0.00%

发文量

282

审稿时长

39 days

期刊介绍： Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.