{"title":"Consistent features observed in structural probing data of eukaryotic RNAs.","authors":"Kazuteru Yamamura, Kiyoshi Asai, Junichi Iwakiri","doi":"10.1093/nargab/lqaf001","DOIUrl":"10.1093/nargab/lqaf001","url":null,"abstract":"<p><p>Understanding RNA structure is crucial for elucidating its regulatory mechanisms. With the recent commercialization of messenger RNA vaccines, the profound impact of RNA structure on stability and translation efficiency has become increasingly evident, underscoring the importance of understanding RNA structure. Chemical probing of RNA has emerged as a powerful technique for investigating RNA structure in living cells. This approach utilizes chemical probes that selectively react with accessible regions of RNA, and by measuring reactivity, the openness and potential of RNA for protein binding or base pairing can be inferred. Extensive experimental data generated using RNA chemical probing have significantly contributed to our understanding of RNA structure in cells. However, it is crucial to acknowledge potential biases in chemical probing data to ensure an accurate interpretation. In this study, we comprehensively analyzed transcriptome-scale RNA chemical probing data in eukaryotes and report common features. Notably, in all experiments, the number of bases modified in probing was small, the bases showing the top 10% reactivity well reflected the known secondary structure, bases with high reactivity were more likely to be exposed to solvent and low reactivity did not reflect solvent exposure, which is important information for the analysis of RNA chemical probing data.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqaf001"},"PeriodicalIF":4.0,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11780854/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bishal Shrestha, Andrew Jordan Siciliano, Hao Zhu, Tong Liu, Zheng Wang
{"title":"scHiGex: predicting single-cell gene expression based on single-cell Hi-C data.","authors":"Bishal Shrestha, Andrew Jordan Siciliano, Hao Zhu, Tong Liu, Zheng Wang","doi":"10.1093/nargab/lqaf002","DOIUrl":"10.1093/nargab/lqaf002","url":null,"abstract":"<p><p>A novel biochemistry experiment named HiRES has been developed to capture both the chromosomal conformations and gene expression levels of individual single cells simultaneously. Nevertheless, when compared to the extensive volume of single-cell Hi-C data generated from individual cells, the number of datasets produced from this experiment remains limited in the scientific community. Hence, there is a requirement for a computational tool that can forecast the levels of gene expression in individual cells using single-cell Hi-C data from the same cells. We trained a graph transformer called scHiGex that accurately and effectively predicts gene expression levels based on single-cell Hi-C data. We conducted a benchmark of scHiGex that demonstrated notable performance on the predictions with an average absolute error of 0.07. Furthermore, the predicted levels of gene expression led to precise categorizations (adjusted Rand index score 1) of cells into distinct cell types, demonstrating that our model effectively captured the heterogeneity between individual cell types. scHiGex is freely available at https://github.com/zwang-bioinformatics/scHiGex.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqaf002"},"PeriodicalIF":4.0,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11770341/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143053403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"stana: an R package for metagenotyping analysis and interactive application based on clinical data.","authors":"Noriaki Sato, Kotoe Katayama, Daichi Miyaoka, Miho Uematsu, Ayumu Saito, Kosuke Fujimoto, Satoshi Uematsu, Seiya Imoto","doi":"10.1093/nargab/lqae191","DOIUrl":"10.1093/nargab/lqae191","url":null,"abstract":"<p><p>Metagenotyping of metagenomic data has recently attracted increasing attention as it resolves intraspecies diversity by identifying single nucleotide variants. Furthermore, gene copy number analysis within species provides a deeper understanding of metabolic functions in microbial communities. However, a platform for examining metagenotyping results based on relevant grouping data is lacking. Here, we have developed the R package, stana, for the processing and analysis of metagenotyping results. The package consists of modules for preprocessing, statistical analysis, functional analysis and visualization. An interactive analysis environment for exploring the metagenotyping results was also developed and publicly released with over 1000 publicly available metagenome samples related to human diseases. Three examples exploring the relationship between the metagenotypes of the gut microbiome and human diseases are presented-end-stage renal disease, Crohn's disease and Parkinson's disease. The results suggest that stana facilitated the confirmation of the original study's findings and the generation of a new hypothesis. The GitHub repository for the package is available at https://github.com/noriakis/stana.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae191"},"PeriodicalIF":4.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11707543/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olusola Olagoke, Ammar Aziz, Lucile H Zhu, Timothy D Read, Deborah Dean
{"title":"Whole-genome automated assembly pipeline for <i>Chlamydia trachomatis</i> strains from reference, <i>in vitro</i> and clinical samples using the integrated CtGAP pipeline.","authors":"Olusola Olagoke, Ammar Aziz, Lucile H Zhu, Timothy D Read, Deborah Dean","doi":"10.1093/nargab/lqae187","DOIUrl":"10.1093/nargab/lqae187","url":null,"abstract":"<p><p>Whole genome sequencing (WGS) is pivotal for the molecular characterization of <i>Chlamydia trachomatis</i> (<i>Ct</i>)-the leading bacterial cause of sexually transmitted infections and infectious blindness worldwide. <i>Ct</i> WGS can inform epidemiologic, public health and outbreak investigations of these human-restricted pathogens. However, challenges persist in generating high-quality genomes for downstream analyses given its obligate intracellular nature and difficulty with <i>in vitro</i> propagation. No single tool exists for the entirety of <i>Ct</i> genome assembly, necessitating the adaptation of multiple programs with varying success. Compounding this issue is the absence of reliable <i>Ct</i> reference strain genomes. We, therefore, developed CtGAP-<i>Chlamydia trachomatis</i>Genome Assembly Pipeline-as an integrated 'one-stop-shop' pipeline for assembly and characterization of <i>Ct</i> genome sequencing data from various sources including isolates, <i>in vitro</i> samples, clinical swabs and urine. CtGAP, written in Snakemake, enables read quality statistics output, adapter and quality trimming, host read removal, <i>de novo</i> and reference-guided assembly, contig scaffolding, selective <i>omp</i>A, multi-locus-sequence and plasmid typing, phylogenetic tree construction, and recombinant genome identification. Twenty <i>Ct</i> reference genomes were also generated. Successfully validated on a diverse collection of 363 samples containing <i>Ct</i>, CtGAP represents a novel pipeline requiring minimal bioinformatics expertise with easy adaptation for use with other bacterial species.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae187"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704784/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Markus J Sommer, Aleksey V Zimin, Steven L Salzberg
{"title":"PSAURON: a tool for assessing protein annotation across a broad range of species.","authors":"Markus J Sommer, Aleksey V Zimin, Steven L Salzberg","doi":"10.1093/nargab/lqae189","DOIUrl":"10.1093/nargab/lqae189","url":null,"abstract":"<p><p>Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript, we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to help assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein-coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins. Validation against established benchmarks demonstrates PSAURON's effectiveness and correlation with recognized measures of protein quality, highlighting its potential use as a widely applicable method to evaluate precision in gene annotation. PSAURON is open source and freely available at https://github.com/salzberg-lab/PSAURON.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae189"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704789/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SProtFP: a machine learning-based method for functional classification of small ORFs in prokaryotes.","authors":"Akshay Khanduja, Debasisa Mohanty","doi":"10.1093/nargab/lqae186","DOIUrl":"10.1093/nargab/lqae186","url":null,"abstract":"<p><p>Small proteins (≤100 amino acids) play important roles across all life forms, ranging from unicellular bacteria to higher organisms. In this study, we have developed SProtFP which is a machine learning-based method for functional annotation of prokaryotic small proteins into selected functional categories. SProtFP uses independent artificial neural networks (ANNs) trained using a combination of physicochemical descriptors for classifying small proteins into antitoxin type 2, bacteriocin, DNA-binding, metal-binding, ribosomal protein, RNA-binding, type 1 toxin and type 2 toxin proteins. We have also trained a model for identification of small open reading frame (smORF)-encoded antimicrobial peptides (AMPs). Comprehensive benchmarking of SProtFP revealed an average area under the receiver operator curve (ROC-AUC) of 0.92 during 10-fold cross-validation and an ROC-AUC of 0.94 and 0.93 on held-out balanced and imbalanced test sets. Utilizing our method to annotate bacterial isolates from the human gut microbiome, we could identify thousands of remote homologs of known small protein families and assign putative functions to uncharacterized proteins. This highlights the utility of SProtFP for large-scale functional annotation of microbiome datasets, especially in cases where sequence homology is low. SProtFP is freely available at http://www.nii.ac.in/sprotfp.html and can be combined with genome annotation tools such as ProsmORF-pred to uncover the functional repertoire of novel small proteins in bacteria.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae186"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mariia Minaeva, Júlia Domingo, Philipp Rentzsch, Tuuli Lappalainen
{"title":"Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs.","authors":"Mariia Minaeva, Júlia Domingo, Philipp Rentzsch, Tuuli Lappalainen","doi":"10.1093/nargab/lqae178","DOIUrl":"10.1093/nargab/lqae178","url":null,"abstract":"<p><p>Understanding the role of transcription and transcription factors (TFs) in cellular identity and disease, such as cancer, is essential. However, comprehensive data resources for cell line-specific TF-to-target gene annotations are currently limited. To address this, we employed a straightforward method to define regulons that capture the cell-specific aspects of TF binding and transcript expression levels. By integrating cellular transcriptome and TF binding data, we generated regulons for 40 common cell lines comprising both proximal and distal cell line-specific regulatory events. Through systematic benchmarking involving TF knockout experiments, we demonstrated performance on par with state-of-the-art methods, with our method being easily applicable to other cell types of interest. We present case studies using three cancer single-cell datasets to showcase the utility of these cell-type-specific regulons in exploring transcriptional dysregulation. In summary, this study provides a valuable pipeline and a resource for systematically exploring cell line-specific transcriptional regulations, emphasizing the utility of network analysis in deciphering disease mechanisms.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae178"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"vClean: assessing virus sequence contamination in viral genomes.","authors":"Ryota Wagatsuma, Yohei Nishikawa, Masahito Hosokawa, Haruko Takeyama","doi":"10.1093/nargab/lqae185","DOIUrl":"10.1093/nargab/lqae185","url":null,"abstract":"<p><p>Recent advancements in viral metagenomics and single-virus genomics have improved our ability to obtain the draft genomes of environmental viruses. However, these methods can introduce virus sequence contaminations into viral genomes when short, fragmented partial sequences are present in the assembled contigs. These contaminations can lead to incorrect analyses; however, practical detection tools are lacking. In this study, we introduce vClean, a novel automated tool that detects contaminations in viral genomes. By applying machine learning to the nucleotide sequence features and gene patterns of the input viral genome, vClean could identify contaminations. Specifically, for tailed double-stranded DNA phages, we attempted accurate predictions by defining single-copy-like genes and counting their duplications. We evaluated the performance of vClean using simulated datasets derived from complete reference genomes, achieving a binary accuracy of 0.932. When vClean was applied to 4693 genomes of medium or higher quality derived from public ocean metagenomic data, 1604 genomes (34.2%) were identified as contaminated. We also demonstrated that vClean can detect contamination in single-virus genome data obtained from river water. vClean provides a new benchmark for quality control of environmental viral genomes and has the potential to become an essential tool for environmental viral genome analysis.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae185"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProPr54 web server: predicting σ<sup>54</sup> promoters and regulon with a hybrid convolutional and recurrent deep neural network.","authors":"Tristan Achterberg, Anne de Jong","doi":"10.1093/nargab/lqae188","DOIUrl":"10.1093/nargab/lqae188","url":null,"abstract":"<p><p>σ<sup>54</sup> serves as an unconventional sigma factor with a distinct mechanism of transcription initiation, which depends on the involvement of a transcription activator. This unique sigma factor σ<sup>54</sup> is indispensable for orchestrating the transcription of genes crucial to nitrogen regulation, flagella biosynthesis, motility, chemotaxis and various other essential cellular processes. Currently, no comprehensive tools are available to determine σ<sup>54</sup> promoters and regulon in bacterial genomes. Here, we report a σ<sup>54</sup> promoter prediction method ProPr54, based on a convolutional neural network trained on a set of 446 validated σ<sup>54</sup> binding sites derived from 33 bacterial species. Model performance was tested and compared with respect to bacterial intergenic regions, demonstrating robust applicability. ProPr54 exhibits high performance when tested on various bacterial species, highly surpassing other available σ<sup>54</sup> regulon identification methods. Furthermore, analysis on bacterial genomes, which have no experimentally validated σ<sup>54</sup> binding sites, demonstrates the generalization of the model. ProPr54 is the first reliable <i>in</i> <i>silico</i> method for predicting σ<sup>54</sup> binding sites, making it a valuable tool to support experimental studies on σ<sup>54</sup>. In conclusion, ProPr54 offers a reliable, broadly applicable tool for predicting σ<sup>54</sup> promoters and regulon genes in bacterial genome sequences. A web server is freely accessible at http://propr54.molgenrug.nl.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae188"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704786/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sapir Margalit, Zuzana Tulpová, Tahir Detinis Zur, Yael Michaeli, Jasline Deek, Gil Nifker, Rita Haldar, Yehudit Gnatek, Dorit Omer, Benjamin Dekel, Hagit Baris Feldman, Assaf Grunwald, Yuval Ebenstein
{"title":"Long-read structural and epigenetic profiling of a kidney tumor-matched sample with nanopore sequencing and optical genome mapping.","authors":"Sapir Margalit, Zuzana Tulpová, Tahir Detinis Zur, Yael Michaeli, Jasline Deek, Gil Nifker, Rita Haldar, Yehudit Gnatek, Dorit Omer, Benjamin Dekel, Hagit Baris Feldman, Assaf Grunwald, Yuval Ebenstein","doi":"10.1093/nargab/lqae190","DOIUrl":"10.1093/nargab/lqae190","url":null,"abstract":"<p><p>Carcinogenesis often involves significant alterations in the cancer genome, marked by large structural variants (SVs) and copy number variations (CNVs) that are difficult to capture with short-read sequencing. Traditionally, cytogenetic techniques are applied to detect such aberrations, but they are limited in resolution and do not cover features smaller than several hundred kilobases. Optical genome mapping (OGM) and nanopore sequencing [Oxford Nanopore Technologies (ONT)] bridge this resolution gap and offer enhanced performance for cytogenetic applications. Additionally, both methods can capture epigenetic information as they profile native, individual DNA molecules. We compared the effectiveness of the two methods in characterizing the structural, copy number and epigenetic landscape of a clear cell renal cell carcinoma tumor. Both methods provided comparable results for basic karyotyping and CNVs, but differed in their ability to detect SVs of different sizes and types. ONT outperformed OGM in detecting small SVs, while OGM excelled in detecting larger SVs, including translocations. Differences were also observed among various ONT SV callers. Additionally, both methods provided insights into the tumor's methylome and hydroxymethylome. While ONT was superior in methylation calling, hydroxymethylation reports can be further optimized. Our findings underscore the importance of carefully selecting the most appropriate platform based on specific research questions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae190"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142955985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}