Karen O'Connor, Davy Weissenbacher, Amir Elyaderani, Ebbing Lautenbach, Matthew Scotch, Graciela Gonzalez-Hernandez
{"title":"SARS-CoV-2测序研究中报告的患者相关元数据:范围审查和文献计量分析方案","authors":"Karen O'Connor, Davy Weissenbacher, Amir Elyaderani, Ebbing Lautenbach, Matthew Scotch, Graciela Gonzalez-Hernandez","doi":"10.2196/58567","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, such as the GISAID (Global Initiative on Sharing All Influenza Data) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic (the study of evolutionary relationships among biological entities) analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact. While these repositories include fields reflecting patient-related metadata for a given sequence, the inclusion of these demographic and clinical details is scarce. The current understanding of patient-related metadata in published sequencing studies and its quality remains unexplored.</p><p><strong>Objective: </strong>Our review aims to quantitatively assess the extent and quality of patient-reported metadata in papers reporting original whole genome sequencing of the SARS-CoV-2 virus and analyze publication patterns using bibliometric analysis. Finally, we will evaluate the efficacy and reliability of a machine learning classifier in accurately identifying relevant papers for inclusion in the scoping review.</p><p><strong>Methods: </strong>The National Institutes of Health's LitCovid collection will be used for the automated classification of papers reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in MEDLINE and PubMed Central for validation. Data extraction will be conducted using Covidence (Veritas Health Innovation Ltd). The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations, citation metrics, author keywords, and Medical Subject Headings terms will be extracted.</p><p><strong>Results: </strong>This study is expected to be completed in early 2025. Our classification model has been developed and we have classified publications in LitCovid published through February 2023. As of September 2024, papers through August 2024 are being prepared for processing. Screening is underway for validated papers from the classifier. Direct literature searches and screening of the results began in October 2024. We will summarize and narratively describe our findings using tables, graphs, and charts where applicable.</p><p><strong>Conclusions: </strong>This scoping review will report findings on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in the reporting of patient metadata, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.</p><p><strong>Trial registration: </strong>OSF Registries osf.io/wrh95; https://doi.org/10.17605/OSF.IO/WRH95.</p><p><strong>International registered report identifier (irrid): </strong>DERR1-10.2196/58567.</p>","PeriodicalId":14755,"journal":{"name":"JMIR Research Protocols","volume":"14 ","pages":"e58567"},"PeriodicalIF":1.4000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12056431/pdf/","citationCount":"0","resultStr":"{\"title\":\"Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis.\",\"authors\":\"Karen O'Connor, Davy Weissenbacher, Amir Elyaderani, Ebbing Lautenbach, Matthew Scotch, Graciela Gonzalez-Hernandez\",\"doi\":\"10.2196/58567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, such as the GISAID (Global Initiative on Sharing All Influenza Data) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic (the study of evolutionary relationships among biological entities) analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact. While these repositories include fields reflecting patient-related metadata for a given sequence, the inclusion of these demographic and clinical details is scarce. The current understanding of patient-related metadata in published sequencing studies and its quality remains unexplored.</p><p><strong>Objective: </strong>Our review aims to quantitatively assess the extent and quality of patient-reported metadata in papers reporting original whole genome sequencing of the SARS-CoV-2 virus and analyze publication patterns using bibliometric analysis. Finally, we will evaluate the efficacy and reliability of a machine learning classifier in accurately identifying relevant papers for inclusion in the scoping review.</p><p><strong>Methods: </strong>The National Institutes of Health's LitCovid collection will be used for the automated classification of papers reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in MEDLINE and PubMed Central for validation. Data extraction will be conducted using Covidence (Veritas Health Innovation Ltd). The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations, citation metrics, author keywords, and Medical Subject Headings terms will be extracted.</p><p><strong>Results: </strong>This study is expected to be completed in early 2025. Our classification model has been developed and we have classified publications in LitCovid published through February 2023. As of September 2024, papers through August 2024 are being prepared for processing. Screening is underway for validated papers from the classifier. Direct literature searches and screening of the results began in October 2024. We will summarize and narratively describe our findings using tables, graphs, and charts where applicable.</p><p><strong>Conclusions: </strong>This scoping review will report findings on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in the reporting of patient metadata, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.</p><p><strong>Trial registration: </strong>OSF Registries osf.io/wrh95; https://doi.org/10.17605/OSF.IO/WRH95.</p><p><strong>International registered report identifier (irrid): </strong>DERR1-10.2196/58567.</p>\",\"PeriodicalId\":14755,\"journal\":{\"name\":\"JMIR Research Protocols\",\"volume\":\"14 \",\"pages\":\"e58567\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12056431/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Research Protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/58567\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Research Protocols","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/58567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
摘要
背景:对SARS-CoV-2病毒进行测序并研究其分子进化的努力前所未有。这得益于可公开访问的数据库的可用性,例如GISAID(共享所有流感数据全球倡议)和基因库,它们共同拥有数百万条SARS-CoV-2序列记录。然而,基因组流行病学试图超越系统发育(研究生物实体之间的进化关系)分析,将遗传信息与患者特征和疾病结果联系起来,使人们能够全面了解传播动力学和疾病影响。虽然这些存储库包含反映给定序列的患者相关元数据的字段,但很少包含这些人口统计和临床细节。目前对已发表的测序研究中患者相关元数据的理解及其质量仍未得到探索。目的:本综述旨在定量评估报道SARS-CoV-2病毒原始全基因组测序的论文中患者报告元数据的范围和质量,并使用文献计量学分析分析发表模式。最后,我们将评估机器学习分类器在准确识别相关论文以纳入范围审查方面的有效性和可靠性。方法:美国国立卫生研究院(National Institutes of Health)的LitCovid数据库将用于报告已在公共数据库中存放SARS-CoV-2序列的论文的自动分类,同时将在MEDLINE和PubMed Central进行独立检索以进行验证。数据提取将使用Covidence (Veritas Health Innovation Ltd .)进行。将对提取的数据进行综合和汇总,以量化SARS-CoV-2测序研究已发表文献中患者元数据的可用性。对于文献计量学分析,相关的数据点,如作者隶属关系、引文计量、作者关键字和医学主题词将被提取。结果:本研究预计于2025年初完成。我们的分类模型已经开发出来,我们已经在LitCovid上发布了到2023年2月的分类出版物。截至2024年9月,2024年8月的论文正在准备处理。筛选从分类器验证论文正在进行中。直接文献检索和结果筛选始于2024年10月。我们将使用表格、图形和图表总结和叙述我们的发现。结论:本次范围审查将报告关于SARS-CoV-2基因组病毒测序研究中报告的患者相关元数据的范围和类型的发现,确定患者元数据报告中的差距,并就提高这一领域报告的质量和一致性提出建议。文献计量分析将揭示患者相关元数据报告的趋势和模式,包括基于研究类型或地理区域的报告差异。从本研究中获得的见解可能有助于提高报告患者元数据的质量和一致性,增强序列元数据的效用,并促进未来对传染病的研究。试验注册:OSF登记处OSF .io/wrh95;https://doi.org/10.17605/OSF.IO/WRH95.International注册报告标识符(irrid): DERR1-10.2196/58567。
Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis.
Background: There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, such as the GISAID (Global Initiative on Sharing All Influenza Data) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic (the study of evolutionary relationships among biological entities) analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact. While these repositories include fields reflecting patient-related metadata for a given sequence, the inclusion of these demographic and clinical details is scarce. The current understanding of patient-related metadata in published sequencing studies and its quality remains unexplored.
Objective: Our review aims to quantitatively assess the extent and quality of patient-reported metadata in papers reporting original whole genome sequencing of the SARS-CoV-2 virus and analyze publication patterns using bibliometric analysis. Finally, we will evaluate the efficacy and reliability of a machine learning classifier in accurately identifying relevant papers for inclusion in the scoping review.
Methods: The National Institutes of Health's LitCovid collection will be used for the automated classification of papers reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in MEDLINE and PubMed Central for validation. Data extraction will be conducted using Covidence (Veritas Health Innovation Ltd). The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations, citation metrics, author keywords, and Medical Subject Headings terms will be extracted.
Results: This study is expected to be completed in early 2025. Our classification model has been developed and we have classified publications in LitCovid published through February 2023. As of September 2024, papers through August 2024 are being prepared for processing. Screening is underway for validated papers from the classifier. Direct literature searches and screening of the results began in October 2024. We will summarize and narratively describe our findings using tables, graphs, and charts where applicable.
Conclusions: This scoping review will report findings on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in the reporting of patient metadata, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.