PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training.

IF 4 2区 生物学 Q1 GENETICS & HEREDITY
Emma J Griffiths, Inês Mendes, Finlay Maguire, Jennifer L Guthrie, Bryan A Wee, Sarah Schmedes, Kathryn Holt, Chanchal Yadav, Rhiannon Cameron, Charlotte Barclay, Damion Dooley, Duncan MacCannell, Leonid Chindelevitch, Ilene Karsch-Mizrachi, Zahra Waheed, Lee Katz, Robert Petit Iii, Mugdha Dave, Paul Oluniyi, Muhammad Ibtisam Nasar, Amogelang Raphenya, William W L Hsiao, Ruth E Timme
{"title":"PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training.","authors":"Emma J Griffiths, Inês Mendes, Finlay Maguire, Jennifer L Guthrie, Bryan A Wee, Sarah Schmedes, Kathryn Holt, Chanchal Yadav, Rhiannon Cameron, Charlotte Barclay, Damion Dooley, Duncan MacCannell, Leonid Chindelevitch, Ilene Karsch-Mizrachi, Zahra Waheed, Lee Katz, Robert Petit Iii, Mugdha Dave, Paul Oluniyi, Muhammad Ibtisam Nasar, Amogelang Raphenya, William W L Hsiao, Ruth E Timme","doi":"10.1099/mgen.0.001260","DOIUrl":null,"url":null,"abstract":"<p><p>As public health laboratories expand their genomic sequencing and bioinformatics capacity for the surveillance of different pathogens, labs must carry out robust validation, training, and optimization of wet- and dry-lab procedures. Achieving these goals for algorithms, pipelines and instruments often requires that lower quality datasets be made available for analysis and comparison alongside those of higher quality. This range of data quality in reference sets can complicate the sharing of sub-optimal datasets that are vital for the community and for the reproducibility of assays. Sharing of useful, but sub-optimal datasets requires careful annotation and documentation of known issues to enable appropriate interpretation, avoid being mistaken for better quality information, and for these data (and their derivatives) to be easily identifiable in repositories. Unfortunately, there are currently no standardized attributes or mechanisms for tagging poor-quality datasets, or datasets generated for a specific purpose, to maximize their utility, searchability, accessibility and reuse. The Public Health Alliance for Genomic Epidemiology (PHA4GE) is an international community of scientists from public health, industry and academia focused on improving the reproducibility, interoperability, portability, and openness of public health bioinformatic software, skills, tools and data. To address the challenges of sharing lower quality datasets, PHA4GE has developed a set of standardized contextual data tags, namely fields and terms, that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues, increasing their discoverability. The contextual data tags were developed through consultations with the community including input from the International Nucleotide Sequence Data Collaboration (INSDC), and have been standardized using ontologies - community-based resources for defining the tag properties and the relationships between them. The standardized tags are agnostic to the organism and the sequencing technique used and thus can be applied to data generated from any pathogen using an array of sequencing techniques. The tags can also be applied to synthetic (lab created) data. The list of standardized tags is maintained by PHA4GE and can be found at https://github.com/pha4ge/contextual_data_QC_tags. Definitions, ontology IDs, examples of use, as well as a JSON representation, are provided. The PHA4GE QC tags were tested, and are now implemented, by the FDA's GenomeTrakr laboratory network as part of its routine submission process for SARS-CoV-2 wastewater surveillance. We hope that these simple, standardized tags will help improve communication regarding quality control in public repositories, in addition to making datasets of variable quality more easily identifiable. Suggestions for additional tags can be submitted to PHA4GE via the New Term Request Form in the GitHub repository. By providing a mechanism for feedback and suggestions, we also expect that the tags will evolve with the needs of the community.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"10 6","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11261899/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001260","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

As public health laboratories expand their genomic sequencing and bioinformatics capacity for the surveillance of different pathogens, labs must carry out robust validation, training, and optimization of wet- and dry-lab procedures. Achieving these goals for algorithms, pipelines and instruments often requires that lower quality datasets be made available for analysis and comparison alongside those of higher quality. This range of data quality in reference sets can complicate the sharing of sub-optimal datasets that are vital for the community and for the reproducibility of assays. Sharing of useful, but sub-optimal datasets requires careful annotation and documentation of known issues to enable appropriate interpretation, avoid being mistaken for better quality information, and for these data (and their derivatives) to be easily identifiable in repositories. Unfortunately, there are currently no standardized attributes or mechanisms for tagging poor-quality datasets, or datasets generated for a specific purpose, to maximize their utility, searchability, accessibility and reuse. The Public Health Alliance for Genomic Epidemiology (PHA4GE) is an international community of scientists from public health, industry and academia focused on improving the reproducibility, interoperability, portability, and openness of public health bioinformatic software, skills, tools and data. To address the challenges of sharing lower quality datasets, PHA4GE has developed a set of standardized contextual data tags, namely fields and terms, that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues, increasing their discoverability. The contextual data tags were developed through consultations with the community including input from the International Nucleotide Sequence Data Collaboration (INSDC), and have been standardized using ontologies - community-based resources for defining the tag properties and the relationships between them. The standardized tags are agnostic to the organism and the sequencing technique used and thus can be applied to data generated from any pathogen using an array of sequencing techniques. The tags can also be applied to synthetic (lab created) data. The list of standardized tags is maintained by PHA4GE and can be found at https://github.com/pha4ge/contextual_data_QC_tags. Definitions, ontology IDs, examples of use, as well as a JSON representation, are provided. The PHA4GE QC tags were tested, and are now implemented, by the FDA's GenomeTrakr laboratory network as part of its routine submission process for SARS-CoV-2 wastewater surveillance. We hope that these simple, standardized tags will help improve communication regarding quality control in public repositories, in addition to making datasets of variable quality more easily identifiable. Suggestions for additional tags can be submitted to PHA4GE via the New Term Request Form in the GitHub repository. By providing a mechanism for feedback and suggestions, we also expect that the tags will evolve with the needs of the community.

PHA4GE 质量控制上下文数据标签:用于共享存在已知质量问题的公共卫生序列数据集的标准化注释,以便于测试和培训。
随着公共卫生实验室为监测不同病原体而扩大基因组测序和生物信息学能力,实验室必须对湿法和干法实验室程序进行强有力的验证、培训和优化。要实现算法、管道和仪器的这些目标,往往需要将质量较低的数据集与质量较高的数据集放在一起进行分析和比较。参考集数据质量的差异会使共享次优数据集的工作复杂化,而次优数据集对社区和检测的可重复性至关重要。共享有用但次优的数据集需要对已知问题进行仔细的注释和记录,以便进行适当的解释,避免被误认为是质量更好的信息,并使这些数据(及其衍生物)在资源库中易于识别。遗憾的是,目前还没有标准化的属性或机制来标记劣质数据集或为特定目的生成的数据集,以最大限度地提高其实用性、可搜索性、可访问性和可重用性。基因组流行病学公共卫生联盟 (PHA4GE) 是一个由来自公共卫生、工业和学术界的科学家组成的国际社区,致力于提高公共卫生生物信息软件、技能、工具和数据的可重现性、互操作性、可移植性和开放性。为了应对共享低质量数据集所带来的挑战,PHA4GE 开发了一套标准化的上下文数据标签,即字段和术语,可将其纳入提交的公共存储库中,作为标记存在已知质量问题的病原体序列数据的一种手段,从而提高其可发现性。上下文数据标签是通过与社区协商开发的,其中包括国际核苷酸序列数据合作组织(INSDC)的意见,并利用本体进行了标准化,本体是基于社区的资源,用于定义标签属性及其之间的关系。标准化标签与生物体和所使用的测序技术无关,因此可应用于使用一系列测序技术从任何病原体生成的数据。这些标签也可用于合成(实验室创建)数据。标准化标签列表由 PHA4GE 维护,可在 https://github.com/pha4ge/contextual_data_QC_tags 上找到。网站提供了定义、本体 ID、使用示例以及 JSON 表示法。作为 SARS-CoV-2 废水监测例行提交流程的一部分,FDA 的 GenomeTrakr 实验室网络对 PHA4GE QC 标记进行了测试,并已开始实施。我们希望这些简单、标准化的标签除了能让质量参差不齐的数据集更容易识别外,还能帮助改善公共资料库中有关质量控制的交流。有关其他标签的建议可通过 GitHub 存储库中的新术语申请表提交给 PHA4GE。通过提供反馈和建议机制,我们也希望标签能随着社区的需求而发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Microbial Genomics
Microbial Genomics Medicine-Epidemiology
CiteScore
6.60
自引率
2.60%
发文量
153
审稿时长
12 weeks
期刊介绍: Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信