Julian A Paganini, Jesse J Kerkvliet, Oscar Jordan, Gijs Teunis, Nienke L Plantinga, Rodrigo Meneses, Rob J L Willems, Sergio Arredondo-Alonso, Anita C Schürch
{"title":"gplasCC: classification and recovery of plasmids from short-read sequencing data for any bacterial species.","authors":"Julian A Paganini, Jesse J Kerkvliet, Oscar Jordan, Gijs Teunis, Nienke L Plantinga, Rodrigo Meneses, Rob J L Willems, Sergio Arredondo-Alonso, Anita C Schürch","doi":"10.1093/nargab/lqag028","DOIUrl":"https://doi.org/10.1093/nargab/lqag028","url":null,"abstract":"<p><p>Plasmids play a pivotal role in the spread of antibiotic resistance genes. Accurately reconstructing plasmids often requires long-read sequencing, but bacterial genomic data in publicly accessible repositories have historically been derived from short-read sequencing technology. We recently presented an approach for recovering <i>Escherichia coli</i> antimicrobial resistance plasmids using Illumina short reads. This method consisted of combining a robust binary classification tool named plasmidEC with gplas2, which is a tool that makes use of features of the assembly graph to bin predicted plasmid contigs into individual plasmids. Here, we developed plasmidCC, an upgrade from plasmidEC, capable of classifying plasmid contigs using Centrifuge databases. We have developed seven plasmidCC databases in addition to the database for <i>E. coli</i>: six species-specific models (<i>Acinetobacter baumannii, Enterococcus faecium, Enterococcus faecalis, Klebsiella pneumoniae, Staphylococcus aureus</i>, and <i>Salmonella enterica</i>) and one species-independent model for less frequently studied bacterial species. We combined these models with gplasCC to recover plasmids from >100 bacterial species. This approach allows comprehensive analysis of the wealth of bacterial short-read sequencing data available in public repositories and advances our understanding of microbial plasmids.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag028"},"PeriodicalIF":2.8,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12988325/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147469562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing genomic language models for promoter prediction: a comparative study of tokenization and cross-species learning.","authors":"Eyal Hadad, Noia Kogman, Lina Golan, Anva Avraham, Reut Ben-Hamo, Zhi Wei, Lior Rokach, Isana Veksler-Lublinsky","doi":"10.1093/nargab/lqag025","DOIUrl":"https://doi.org/10.1093/nargab/lqag025","url":null,"abstract":"<p><p>Large Language Models (LLMs) are increasingly applied to genomic tasks, yet core challenges remain concerning tokenization, evaluation, and data scarcity. This study focuses on promoter classification and systematically evaluates four tokenization methods: non-overlapping 6-mer, overlapping 6-mer, Byte Pair Encoding (BPE), and WordPiece (WPC). We show that the commonly used k-mer approach, specifically the non-overlapping variant, outperforms BPE and WPC across eight organisms, challenging assumptions derived from natural language processing. To ensure robustness, we evaluated performance under two distinct negative data strategies: positive-promoter-shuffled and random-non-promoter-fragments. Using a positional SHAP framework, we demonstrate that the model learns biologically plausible positional patterns rather than exploiting artifacts from these negative data generation processes. Furthermore, evolutionary-informed transfer learning experiments and external validation on an unseen organism reveal that training on phylogenetically related species significantly improves performance, particularly in low-data regimes. These findings underscore the significant impact of tokenization and negative data design, providing practical guidance for refining genomic classifiers.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag025"},"PeriodicalIF":2.8,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12980338/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147469484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorenzo Sola, Davide Bagordo, Simone Carpanzano, Mariangela Santorsola, Francesco Lescai
{"title":"Eduomics: a Nextflow pipeline to simulate -omics data for education.","authors":"Lorenzo Sola, Davide Bagordo, Simone Carpanzano, Mariangela Santorsola, Francesco Lescai","doi":"10.1093/nargab/lqag029","DOIUrl":"https://doi.org/10.1093/nargab/lqag029","url":null,"abstract":"<p><p>Bridging the gap between learning algorithms and biological interpretation remains the central challenge of bioinformatics education. To move students beyond just learning code and acquire higher-order knowledge, educators face a complexity overload when attempting to design realistic learning experiences. Simulating realistic data forces educators to master different tools and dependencies. Existing simulators are built for benchmarking and lack the narrative context essential for teaching interpretation; this limitation prevents a shift from problem-solving to what we believe should be an immersive 'storyline-based learning'. Eduomics facilitates massive scaling: educators can generate hundreds of unique, validated datasets for assessments or tutoring in a matter of hours. By removing barriers to adoption and embedding raw data within a rich clinical context, eduomics offers an accessible, scalable solution to truly innovate bioinformatics education. Eduomics facilitates massive scaling: educators can generate hundreds of unique, validated datasets for assessments or tutoring in a matter of hours. By removing barriers to adoption and embedding raw data within a rich clinical context, eduomics offers an accessible, scalable solution to truly innovate bioinformatics education.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag029"},"PeriodicalIF":2.8,"publicationDate":"2026-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12972896/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147436260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MycoMobilome: a community-focused non-redundant database of transposable element consensus sequences for the fungal kingdom.","authors":"Tobias Baril, Daniel Croll","doi":"10.1093/nargab/lqag026","DOIUrl":"10.1093/nargab/lqag026","url":null,"abstract":"<p><p>Transposable elements (TEs) are found in nearly all eukaryotic genomes. Despite significant advances in the sequencing of genomes, TE resources remain sparse, leading to a lack of traceability, reproducibility, and duplication of effort when annotating TEs. Here, we focus on the fungal kingdom and present MycoMobilome, a database of TE consensus sequences computationally curated using a set of 4309 genomes covering all major clades. The initial database contains 586 441 consensus sequences after filtering to remove putative host genes and low-quality consensus sequences. We provide a consistent naming convention to surface information on the confidence in the classification, including potential conflicting open reading frame functions, along with metadata to enable evaluation of TEs of interest and to determine whether further curation work is required on a case-by-case basis. Finally, we provide guidelines for community contributions and encourage researchers to deposit new or curated sequences, which will be incorporated into future MycoMobilome releases.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag026"},"PeriodicalIF":2.8,"publicationDate":"2026-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12961425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147378835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequence-based modeling of low-affinity transcription factor-DNA binding through deep learning.","authors":"Yingfei Wang, Jinsen Li, Tsu-Pei Chiu, Beibei Xin, Remo Rohs","doi":"10.1093/nargab/lqag027","DOIUrl":"10.1093/nargab/lqag027","url":null,"abstract":"<p><p>Multiple layers of molecular determinants and mechanisms affect binding specificity between transcription factors (TFs) and DNA. DNA sequence-based deep learning models using convolutional neural networks (CNNs) and self-attention (SA) transformers have improved modeling accuracy and advanced our understanding of TF-DNA binding specificity through network interpretation. However, the systematic evaluation of various strategies for handling DNA sequence orientations in deep learning models-and their interpretation-remains underexplored, especially in the context of learning low-affinity binding site specificity. Using SELEX-seq data for eight Exd-Hox heterodimers in <i>Drosophila</i>, we compared canonical models with data augmentation and reverse-complement weight-sharing models. We found that reverse-complement weight-sharing CNN models and SA models trained with augmented data with reverse complements outperformed other approaches in modeling binding specificity. In this work, we evaluated several interpretation methods, including Gradient*input, DeconvNet, DeepLIFT, and <i>in silico</i> mutagenesis (ISM). Compared to other interpretation methods, ISM was less sensitive to model hyperparameter settings. In this work, we identified Exd-Ubx binding at low-affinity sites and suggested possible biophysical mechanisms. The findings of this study will be relevant for studying the functional role of low-affinity TF binding in gene regulatory mechanisms with possible implications on TF-DNA binding specificity guided protein design.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag027"},"PeriodicalIF":2.8,"publicationDate":"2026-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12961433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147378808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maahil Arshad, Matthew Uchmanowicz, Vanshika Rana, Brett Trost, Stephen W Scherer, Muhammad Arshad Rafiq
{"title":"Mapping the inter- and intra-genic codon-usage landscape in <i>Homo sapiens</i>.","authors":"Maahil Arshad, Matthew Uchmanowicz, Vanshika Rana, Brett Trost, Stephen W Scherer, Muhammad Arshad Rafiq","doi":"10.1093/nargab/lqag024","DOIUrl":"10.1093/nargab/lqag024","url":null,"abstract":"<p><p>Although the genetic code is degenerate, codon selection is nonrandom and reflects significant functional constraints. Codon-usage bias (CUB) acts as a layer of post-transcriptional regulation, influencing messenger RNA (mRNA) stability, translation kinetics, and co-translational protein folding. While CUB is well-characterized in unicellular organisms, its regulatory scope and functional consequences in humans remain complex and less defined. Our study offers a comprehensive evaluation of human codon usage. We report that genes exhibiting the strongest codon bias are enriched in high-stoichiometry biological processes, such as skin development and oxygen/carbon dioxide transport, and harbor significantly fewer synonymous variants than expected (ρ = -0.24, <i>P </i>< 2.2 × 10<sup>-16</sup>). Furthermore, we find that codon optimization is structurally distinct: it is significantly more pronounced in structured protein domains compared to intrinsically disordered regions (IDRs) (Cliff's Δ= 0.26, <i>P </i>< 2.2 × 10<sup>-16</sup>). Consistent with translational selection, the most frequently used codons are supported by higher transfer RNA (tRNA) gene copy numbers (ρ = 0.49, <i>P</i> < 6.4 × 10<sup>-4</sup>). Finally, by correcting for GC3 content, we reveal that the apparent correlation between effective number of codon and adaptation indices (CAI/tAI) vanishes, allowing us to disentangle mutational pressure from translational selection. Collectively, our findings position CUB as a central, evolutionarily conserved regulator of translation and protein folding in humans. Our results provide a comprehensive and integrated view of intergenic and intragenic CUB in humans, reinforcing the biological relevance of synonymous codon choice in shaping translational dynamics and protein biogenesis. This provides a refined framework for interpreting synonymous variation and guiding functional genomics.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag024"},"PeriodicalIF":2.8,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12954173/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147356728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recurrence plot reconstruction reveals chromosomal reorganization before territory formation.","authors":"Yuki Kitanishi, Hiroki Sugishita, Yukiko Gotoh, Yoshito Hirata","doi":"10.1093/nargab/lqag023","DOIUrl":"10.1093/nargab/lqag023","url":null,"abstract":"<p><p>Chromatin conformation capture methods such as Hi-C have improved understanding of nuclear architecture. However, reconstruction from single-cell Hi-C (scHi-C) data is challenging due to limited DNA contacts per cell. We have previously developed the recurrence plot-based reconstruction (RPR) method for reconstructing three-dimensional (3D) genomic structure from Hi-C data even from low-coverage DNA contact information. Here we used the RPR method to analyze scHi-C data derived from early-stage F<sub>1</sub> hybrid embryos as a proof-of-concept for understanding of global chromosomal architecture. We found that paternal and maternal chromosomes become gradually intermingled from the 1-cell to the 64-cell stage, and that discrete chromosome territories are largely established between 8-cell and 64-cell stages. We also observed Rabl-like polarization of chromosomes from the 2- to 8-cell stage, which was mostly dissolved by the 64-cell stage. We also noted transient rod-like extension and parallel chromosome alignment at the 4-cell stage. These findings indicate dynamic chromosomal reorganization before territory formation. RPR and scHi-C together capture 3D chromosomal architecture of individual cells during early embryogenesis.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag023"},"PeriodicalIF":2.8,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12932954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147310717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Risa Okubo, Takashi Morikura, Yusuke Hiki, Yuta Tokuoka, Tetsuya J Kobayashi, Takahiro G Yamada, Akira Funahashi
{"title":"Mixing features of transcription factors and genes enable accurate prediction of gene regulation relationships for unknown transcription factors.","authors":"Risa Okubo, Takashi Morikura, Yusuke Hiki, Yuta Tokuoka, Tetsuya J Kobayashi, Takahiro G Yamada, Akira Funahashi","doi":"10.1093/nargab/lqag022","DOIUrl":"10.1093/nargab/lqag022","url":null,"abstract":"<p><p>Identifying regulatory relationships between transcription factors (TFs) and genes is essential to understand diverse biological phenomena related to gene expression. Recently, deep learning-based models to predict TFs that bind to genes from nucleotide sequences of the target genes have been developed, yet these models are trained to predict known TFs only. Here, we developed a deep learning model, GReNIMJA (Gene Regulatory Network Inference by Mixing and Jointing features of Amino acid and nucleotide sequences), to predict gene regulation even by unknown TFs. Our model is designed to mix the features of the TF amino acid sequences and nucleotide sequences of the target genes using a 2D Long Short-Term Memory architecture and to perform binary classification with the aim of determining the presence or absence of a regulatory relationship. By explicitly modeling interactions between TFs and genes, our model can predict gene regulation for unknown TFs. The accuracy of our model in predicting regulatory relationships was 84.4% for known TFs (higher than those of conventional models) and 68.5% for unknown TFs; the latter is an unsolved task for conventional deep learning-based models. We expect our model to advance identification of unknown gene regulatory networks and contribute to the understanding of diverse biological phenomena.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag022"},"PeriodicalIF":2.8,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12954442/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147356711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Burak Yelmen, Maris Alver, Merve Nur Güler, Flora Jay, Lili Milani
{"title":"Interpreting artificial neural networks to detect genome-wide association signals for complex traits.","authors":"Burak Yelmen, Maris Alver, Merve Nur Güler, Flora Jay, Lili Milani","doi":"10.1093/nargab/lqag019","DOIUrl":"10.1093/nargab/lqag019","url":null,"abstract":"<p><p>Investigating the genetic architecture of complex diseases is challenging due to the multifactorial interplay of genomic and environmental influences. Although GWAS have identified thousands of variants for multiple complex traits, conventional statistical approaches can be limited by simplified assumptions such as linearity and lack of epistasis. In this work, we trained artificial neural networks using genome-wide genotype data to predict simulated and real complex traits. We extracted feature importance scores via different post hoc interpretability methods to identify potentially associated locus/loci (PAL) for the target phenotype and devised an approach for estimating <i>P</i>-values for the detected PAL. Simulations demonstrated that associated loci can be detected with good precision using strict selection criteria. By applying our approach to the schizophrenia cohort in the Estonian Biobank, we detected multiple loci not identified by linear methods. There was significant concordance between PAL and loci previously associated with schizophrenia and bipolar disorder, with enrichment analyses of genes within the identified PAL predominantly highlighting terms related to brain morphology and function. With advancements in model optimization and uncertainty quantification, artificial neural networks have the potential to enhance the identification of genomic loci associated with complex diseases, offering a more comprehensive approach for GWAS.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"8 1","pages":"lqag019"},"PeriodicalIF":2.8,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12964191/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147378780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ragini Mishra, Nahid Akhtar, Jorge Samuel Leon Magdeleno, Abdul Rajjak Shaikh, Manik Prabhu Narsing Rao, Neeta Raj Sharma, Luigi Cavallo, Mohit Chawla
{"title":"Development of a vaccine construct against <i>Pneumocystis jirovecii</i> pneumonia using computational tools.","authors":"Ragini Mishra, Nahid Akhtar, Jorge Samuel Leon Magdeleno, Abdul Rajjak Shaikh, Manik Prabhu Narsing Rao, Neeta Raj Sharma, Luigi Cavallo, Mohit Chawla","doi":"10.1093/nargab/lqaf199","DOIUrl":"10.1093/nargab/lqaf199","url":null,"abstract":"<p><p><i>Pneumocystis jirovecii</i> poses a significant threat to immunocompromised individuals, necessitating the development of an effective vaccine. This study employs an immunoinformatics approach to design a promising vaccine candidate against <i>P. jirovecii</i>. Utilizing various computational tools, the study identified potential antigenic epitopes capable of eliciting immune responses within the <i>P. jirovecii</i> major surface glycoprotein C. The chosen epitopes were evaluated using computational tools for their allergenicity, interferon-γ and interleukin activation ability, and toxicity, ensuring the selection of immunogenic and safe candidates. These analyses led to the selection of 10 epitopes, which were then linked with adjuvants to model a potential vaccine candidate. Molecular docking and molecular dynamics simulations were performed in a solvent environment to investigate the binding interactions between the vaccine candidate and toll-like receptors, along with calculations of thermodynamic properties. Finally, <i>in silico</i> immune simulations were performed to analyze the immunogenic potential of the vaccine candidate. Future prospects include <i>in vitro</i> and <i>in vivo</i> validation of the vaccine candidate and the exploration of novel adjuvants to enhance its immunogenicity. This study contributes to the ongoing efforts to develop a preventive solution against <i>P. jirovecii</i> infections, addressing a critical gap in the protection of immunocompromised individuals.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 4","pages":"lqaf199"},"PeriodicalIF":2.8,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12754782/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145890500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}