PhenoQC:基因组研究中表型数据质量控制的集成工具包

Q1 Medicine
Jorge Miguel Silva, José Luis Oliveira
{"title":"PhenoQC:基因组研究中表型数据质量控制的集成工具包","authors":"Jorge Miguel Silva,&nbsp;José Luis Oliveira","doi":"10.1016/j.imu.2025.101693","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Large-scale genomic research requires robust, consistent phenotypic datasets for meaningful genotype–phenotype correlations. However, diverse collection protocols, incomplete entries, and heterogeneous terminologies frequently compromise data quality and slows downstream analysis.</div></div><div><h3>Methodology:</h3><div>To address these issues, we present PhenoQC, a high-throughput, configuration-driven toolkit that unifies schema validation, ontology-based semantic alignment, and missing-data imputation in a single workflow. Its modular architecture leverages chunk-based parallelism to handle large datasets, while customizable schemas enforce structural and type constraints. PhenoQC applies user-defined and state-of-the-art machine learning-based imputation and performs multi-ontology mapping with fuzzy matching to harmonize phenotype text. It also quantifies potential imputation-induced distributional shifts by reporting standardized mean difference, variance ratio, and Kolmogorov–Smirnov statistics for numeric variables, and population stability index and Cramér’s <span><math><mi>V</mi></math></span> for categorical variables, with user-configurable thresholds. The toolkit provides command-line and graphical interfaces for seamless integration into automated pipelines and interactive curation environments.</div></div><div><h3>Results:</h3><div>We benchmarked PhenoQC on synthetic datasets with up to 100,000 records and it demonstrated near-linear scalability and full recovery of artificially missing numeric values.Moreover, PhenoQC’s ontology alignment achieved over 97% accuracy under textual corruption. Finally, using two real clinical datasets, PhenoQC successfully imputed missing values, enforced schema compliance, and flagged data anomalies without significant overhead.</div></div><div><h3>Conclusions:</h3><div>PhenoQC saves manual curation time and ensures consistent, analysis-ready phenotypic data through its streamlined system. Its adaptable design adjusts to evolving ontologies and domain-specific rules, empowering researchers to conduct more reliable studies.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"58 ","pages":"Article 101693"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PhenoQC: An integrated toolkit for quality control of phenotypic data in genomic research\",\"authors\":\"Jorge Miguel Silva,&nbsp;José Luis Oliveira\",\"doi\":\"10.1016/j.imu.2025.101693\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background:</h3><div>Large-scale genomic research requires robust, consistent phenotypic datasets for meaningful genotype–phenotype correlations. However, diverse collection protocols, incomplete entries, and heterogeneous terminologies frequently compromise data quality and slows downstream analysis.</div></div><div><h3>Methodology:</h3><div>To address these issues, we present PhenoQC, a high-throughput, configuration-driven toolkit that unifies schema validation, ontology-based semantic alignment, and missing-data imputation in a single workflow. Its modular architecture leverages chunk-based parallelism to handle large datasets, while customizable schemas enforce structural and type constraints. PhenoQC applies user-defined and state-of-the-art machine learning-based imputation and performs multi-ontology mapping with fuzzy matching to harmonize phenotype text. It also quantifies potential imputation-induced distributional shifts by reporting standardized mean difference, variance ratio, and Kolmogorov–Smirnov statistics for numeric variables, and population stability index and Cramér’s <span><math><mi>V</mi></math></span> for categorical variables, with user-configurable thresholds. The toolkit provides command-line and graphical interfaces for seamless integration into automated pipelines and interactive curation environments.</div></div><div><h3>Results:</h3><div>We benchmarked PhenoQC on synthetic datasets with up to 100,000 records and it demonstrated near-linear scalability and full recovery of artificially missing numeric values.Moreover, PhenoQC’s ontology alignment achieved over 97% accuracy under textual corruption. Finally, using two real clinical datasets, PhenoQC successfully imputed missing values, enforced schema compliance, and flagged data anomalies without significant overhead.</div></div><div><h3>Conclusions:</h3><div>PhenoQC saves manual curation time and ensures consistent, analysis-ready phenotypic data through its streamlined system. Its adaptable design adjusts to evolving ontologies and domain-specific rules, empowering researchers to conduct more reliable studies.</div></div>\",\"PeriodicalId\":13953,\"journal\":{\"name\":\"Informatics in Medicine Unlocked\",\"volume\":\"58 \",\"pages\":\"Article 101693\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics in Medicine Unlocked\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352914825000826\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

摘要

背景:大规模基因组研究需要稳健、一致的表型数据集,以获得有意义的基因型-表型相关性。然而,不同的收集协议、不完整的条目和异构术语经常会影响数据质量,并减慢下游分析的速度。方法:为了解决这些问题,我们提出了PhenoQC,这是一个高吞吐量、配置驱动的工具包,它将模式验证、基于本体的语义对齐和丢失数据的输入统一到一个工作流中。它的模块化架构利用基于块的并行性来处理大型数据集,而可定制的模式则执行结构和类型约束。PhenoQC应用用户定义和最先进的基于机器学习的输入,并执行多本体映射与模糊匹配,以协调表型文本。它还通过报告数字变量的标准化均值差、方差比和Kolmogorov-Smirnov统计数据,以及分类变量的人口稳定指数和cramsamr 's V,以及用户可配置的阈值,来量化潜在的假设引起的分布偏移。该工具包提供命令行和图形界面,用于无缝集成到自动化管道和交互式管理环境中。结果:我们在多达100,000条记录的合成数据集上对PhenoQC进行了基准测试,它展示了近似线性的可扩展性和人为丢失的数值的完全恢复。此外,在文本损坏的情况下,PhenoQC的本体对齐准确率达到了97%以上。最后,使用两个真实的临床数据集,PhenoQC成功地输入了缺失值,强制遵循模式,并标记了数据异常,而没有显著的开销。结论:通过其精简的系统,PhenoQC节省了人工管理时间,并确保了一致的、可分析的表型数据。它的适应性设计适应不断发展的本体和领域特定规则,使研究人员能够进行更可靠的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

PhenoQC: An integrated toolkit for quality control of phenotypic data in genomic research

PhenoQC: An integrated toolkit for quality control of phenotypic data in genomic research

Background:

Large-scale genomic research requires robust, consistent phenotypic datasets for meaningful genotype–phenotype correlations. However, diverse collection protocols, incomplete entries, and heterogeneous terminologies frequently compromise data quality and slows downstream analysis.

Methodology:

To address these issues, we present PhenoQC, a high-throughput, configuration-driven toolkit that unifies schema validation, ontology-based semantic alignment, and missing-data imputation in a single workflow. Its modular architecture leverages chunk-based parallelism to handle large datasets, while customizable schemas enforce structural and type constraints. PhenoQC applies user-defined and state-of-the-art machine learning-based imputation and performs multi-ontology mapping with fuzzy matching to harmonize phenotype text. It also quantifies potential imputation-induced distributional shifts by reporting standardized mean difference, variance ratio, and Kolmogorov–Smirnov statistics for numeric variables, and population stability index and Cramér’s V for categorical variables, with user-configurable thresholds. The toolkit provides command-line and graphical interfaces for seamless integration into automated pipelines and interactive curation environments.

Results:

We benchmarked PhenoQC on synthetic datasets with up to 100,000 records and it demonstrated near-linear scalability and full recovery of artificially missing numeric values.Moreover, PhenoQC’s ontology alignment achieved over 97% accuracy under textual corruption. Finally, using two real clinical datasets, PhenoQC successfully imputed missing values, enforced schema compliance, and flagged data anomalies without significant overhead.

Conclusions:

PhenoQC saves manual curation time and ensures consistent, analysis-ready phenotypic data through its streamlined system. Its adaptable design adjusts to evolving ontologies and domain-specific rules, empowering researchers to conduct more reliable studies.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Informatics in Medicine Unlocked
Informatics in Medicine Unlocked Medicine-Health Informatics
CiteScore
9.50
自引率
0.00%
发文量
282
审稿时长
39 days
期刊介绍: Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信