Guilherme Marcelino Viana de Siqueira, Thomas Eng, Aindrila Mukhopadhyay, María-Eugenia Guazzaroni
{"title":"Differences in GenBank and RefSeq annotations may affect genomics data interpretation for <i>Pseudomonas putida</i> KT2440.","authors":"Guilherme Marcelino Viana de Siqueira, Thomas Eng, Aindrila Mukhopadhyay, María-Eugenia Guazzaroni","doi":"10.1128/msphere.00391-25","DOIUrl":null,"url":null,"abstract":"<p><p>Annotations of genomic features are cornerstone data that support routine workflows in conventional omics analyses in <i>Pseudomonas putida</i> KT2440 and other organisms. The GenBank and the RefSeq versions of the annotated KT2440 genome are two popular resources widely cited in the literature; yet, they originate from distinct prediction pipelines and possess potentially different biological information that is often overlooked. In this study, we systematically compared the features present in these resources and found that approximately 16% of the total of KT2440 open reading frames (ORFs) show differences in their predicted genomic positions across GenBank and RefSeq, despite sharing equivalent locus tag codes. Furthermore, we show that these discrepancies can affect the results of high-throughput analyses by processing a collection of RNAseq expression data sets utilizing both annotations. Our findings provide a comprehensive overview of the current state of available resources for genomics research in <i>P. putida</i> KT2440 and highlight a rarely addressed yet widespread potential pitfall in the literature on this organism, with possible implications for other prokaryotes.IMPORTANCEGenome annotation databases often rely on different statistical models for their function predictions and inherently carry biases propagated into studies using them. This work provides a quantitative assessment of two popular annotation resources for the model bacterium <i>Pseudomonas putida</i> KT2440 and their influence on data interpretation. As large-scale omics data sets are commonly used to inform experimental decisions, our results aim to promote awareness of the caveats associated with these computational resources and foster reproducibility and transparency in <i>P. putida</i> research.</p>","PeriodicalId":19052,"journal":{"name":"mSphere","volume":" ","pages":"e0039125"},"PeriodicalIF":3.1000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSphere","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msphere.00391-25","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Annotations of genomic features are cornerstone data that support routine workflows in conventional omics analyses in Pseudomonas putida KT2440 and other organisms. The GenBank and the RefSeq versions of the annotated KT2440 genome are two popular resources widely cited in the literature; yet, they originate from distinct prediction pipelines and possess potentially different biological information that is often overlooked. In this study, we systematically compared the features present in these resources and found that approximately 16% of the total of KT2440 open reading frames (ORFs) show differences in their predicted genomic positions across GenBank and RefSeq, despite sharing equivalent locus tag codes. Furthermore, we show that these discrepancies can affect the results of high-throughput analyses by processing a collection of RNAseq expression data sets utilizing both annotations. Our findings provide a comprehensive overview of the current state of available resources for genomics research in P. putida KT2440 and highlight a rarely addressed yet widespread potential pitfall in the literature on this organism, with possible implications for other prokaryotes.IMPORTANCEGenome annotation databases often rely on different statistical models for their function predictions and inherently carry biases propagated into studies using them. This work provides a quantitative assessment of two popular annotation resources for the model bacterium Pseudomonas putida KT2440 and their influence on data interpretation. As large-scale omics data sets are commonly used to inform experimental decisions, our results aim to promote awareness of the caveats associated with these computational resources and foster reproducibility and transparency in P. putida research.
期刊介绍:
mSphere™ is a multi-disciplinary open-access journal that will focus on rapid publication of fundamental contributions to our understanding of microbiology. Its scope will reflect the immense range of fields within the microbial sciences, creating new opportunities for researchers to share findings that are transforming our understanding of human health and disease, ecosystems, neuroscience, agriculture, energy production, climate change, evolution, biogeochemical cycling, and food and drug production. Submissions will be encouraged of all high-quality work that makes fundamental contributions to our understanding of microbiology. mSphere™ will provide streamlined decisions, while carrying on ASM''s tradition for rigorous peer review.