{"title":"SProtFP: a machine learning-based method for functional classification of small ORFs in prokaryotes.","authors":"Akshay Khanduja, Debasisa Mohanty","doi":"10.1093/nargab/lqae186","DOIUrl":"10.1093/nargab/lqae186","url":null,"abstract":"<p><p>Small proteins (≤100 amino acids) play important roles across all life forms, ranging from unicellular bacteria to higher organisms. In this study, we have developed SProtFP which is a machine learning-based method for functional annotation of prokaryotic small proteins into selected functional categories. SProtFP uses independent artificial neural networks (ANNs) trained using a combination of physicochemical descriptors for classifying small proteins into antitoxin type 2, bacteriocin, DNA-binding, metal-binding, ribosomal protein, RNA-binding, type 1 toxin and type 2 toxin proteins. We have also trained a model for identification of small open reading frame (smORF)-encoded antimicrobial peptides (AMPs). Comprehensive benchmarking of SProtFP revealed an average area under the receiver operator curve (ROC-AUC) of 0.92 during 10-fold cross-validation and an ROC-AUC of 0.94 and 0.93 on held-out balanced and imbalanced test sets. Utilizing our method to annotate bacterial isolates from the human gut microbiome, we could identify thousands of remote homologs of known small protein families and assign putative functions to uncharacterized proteins. This highlights the utility of SProtFP for large-scale functional annotation of microbiome datasets, especially in cases where sequence homology is low. SProtFP is freely available at http://www.nii.ac.in/sprotfp.html and can be combined with genome annotation tools such as ProsmORF-pred to uncover the functional repertoire of novel small proteins in bacteria.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae186"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704790/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mariia Minaeva, Júlia Domingo, Philipp Rentzsch, Tuuli Lappalainen
{"title":"Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs.","authors":"Mariia Minaeva, Júlia Domingo, Philipp Rentzsch, Tuuli Lappalainen","doi":"10.1093/nargab/lqae178","DOIUrl":"10.1093/nargab/lqae178","url":null,"abstract":"<p><p>Understanding the role of transcription and transcription factors (TFs) in cellular identity and disease, such as cancer, is essential. However, comprehensive data resources for cell line-specific TF-to-target gene annotations are currently limited. To address this, we employed a straightforward method to define regulons that capture the cell-specific aspects of TF binding and transcript expression levels. By integrating cellular transcriptome and TF binding data, we generated regulons for 40 common cell lines comprising both proximal and distal cell line-specific regulatory events. Through systematic benchmarking involving TF knockout experiments, we demonstrated performance on par with state-of-the-art methods, with our method being easily applicable to other cell types of interest. We present case studies using three cancer single-cell datasets to showcase the utility of these cell-type-specific regulons in exploring transcriptional dysregulation. In summary, this study provides a valuable pipeline and a resource for systematically exploring cell line-specific transcriptional regulations, emphasizing the utility of network analysis in deciphering disease mechanisms.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae178"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704787/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"vClean: assessing virus sequence contamination in viral genomes.","authors":"Ryota Wagatsuma, Yohei Nishikawa, Masahito Hosokawa, Haruko Takeyama","doi":"10.1093/nargab/lqae185","DOIUrl":"10.1093/nargab/lqae185","url":null,"abstract":"<p><p>Recent advancements in viral metagenomics and single-virus genomics have improved our ability to obtain the draft genomes of environmental viruses. However, these methods can introduce virus sequence contaminations into viral genomes when short, fragmented partial sequences are present in the assembled contigs. These contaminations can lead to incorrect analyses; however, practical detection tools are lacking. In this study, we introduce vClean, a novel automated tool that detects contaminations in viral genomes. By applying machine learning to the nucleotide sequence features and gene patterns of the input viral genome, vClean could identify contaminations. Specifically, for tailed double-stranded DNA phages, we attempted accurate predictions by defining single-copy-like genes and counting their duplications. We evaluated the performance of vClean using simulated datasets derived from complete reference genomes, achieving a binary accuracy of 0.932. When vClean was applied to 4693 genomes of medium or higher quality derived from public ocean metagenomic data, 1604 genomes (34.2%) were identified as contaminated. We also demonstrated that vClean can detect contamination in single-virus genome data obtained from river water. vClean provides a new benchmark for quality control of environmental viral genomes and has the potential to become an essential tool for environmental viral genome analysis.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae185"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704788/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProPr54 web server: predicting σ<sup>54</sup> promoters and regulon with a hybrid convolutional and recurrent deep neural network.","authors":"Tristan Achterberg, Anne de Jong","doi":"10.1093/nargab/lqae188","DOIUrl":"10.1093/nargab/lqae188","url":null,"abstract":"<p><p>σ<sup>54</sup> serves as an unconventional sigma factor with a distinct mechanism of transcription initiation, which depends on the involvement of a transcription activator. This unique sigma factor σ<sup>54</sup> is indispensable for orchestrating the transcription of genes crucial to nitrogen regulation, flagella biosynthesis, motility, chemotaxis and various other essential cellular processes. Currently, no comprehensive tools are available to determine σ<sup>54</sup> promoters and regulon in bacterial genomes. Here, we report a σ<sup>54</sup> promoter prediction method ProPr54, based on a convolutional neural network trained on a set of 446 validated σ<sup>54</sup> binding sites derived from 33 bacterial species. Model performance was tested and compared with respect to bacterial intergenic regions, demonstrating robust applicability. ProPr54 exhibits high performance when tested on various bacterial species, highly surpassing other available σ<sup>54</sup> regulon identification methods. Furthermore, analysis on bacterial genomes, which have no experimentally validated σ<sup>54</sup> binding sites, demonstrates the generalization of the model. ProPr54 is the first reliable <i>in</i> <i>silico</i> method for predicting σ<sup>54</sup> binding sites, making it a valuable tool to support experimental studies on σ<sup>54</sup>. In conclusion, ProPr54 offers a reliable, broadly applicable tool for predicting σ<sup>54</sup> promoters and regulon genes in bacterial genome sequences. A web server is freely accessible at http://propr54.molgenrug.nl.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae188"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704786/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142956062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sapir Margalit, Zuzana Tulpová, Tahir Detinis Zur, Yael Michaeli, Jasline Deek, Gil Nifker, Rita Haldar, Yehudit Gnatek, Dorit Omer, Benjamin Dekel, Hagit Baris Feldman, Assaf Grunwald, Yuval Ebenstein
{"title":"Long-read structural and epigenetic profiling of a kidney tumor-matched sample with nanopore sequencing and optical genome mapping.","authors":"Sapir Margalit, Zuzana Tulpová, Tahir Detinis Zur, Yael Michaeli, Jasline Deek, Gil Nifker, Rita Haldar, Yehudit Gnatek, Dorit Omer, Benjamin Dekel, Hagit Baris Feldman, Assaf Grunwald, Yuval Ebenstein","doi":"10.1093/nargab/lqae190","DOIUrl":"10.1093/nargab/lqae190","url":null,"abstract":"<p><p>Carcinogenesis often involves significant alterations in the cancer genome, marked by large structural variants (SVs) and copy number variations (CNVs) that are difficult to capture with short-read sequencing. Traditionally, cytogenetic techniques are applied to detect such aberrations, but they are limited in resolution and do not cover features smaller than several hundred kilobases. Optical genome mapping (OGM) and nanopore sequencing [Oxford Nanopore Technologies (ONT)] bridge this resolution gap and offer enhanced performance for cytogenetic applications. Additionally, both methods can capture epigenetic information as they profile native, individual DNA molecules. We compared the effectiveness of the two methods in characterizing the structural, copy number and epigenetic landscape of a clear cell renal cell carcinoma tumor. Both methods provided comparable results for basic karyotyping and CNVs, but differed in their ability to detect SVs of different sizes and types. ONT outperformed OGM in detecting small SVs, while OGM excelled in detecting larger SVs, including translocations. Differences were also observed among various ONT SV callers. Additionally, both methods provided insights into the tumor's methylome and hydroxymethylome. While ONT was superior in methylation calling, hydroxymethylation reports can be further optimized. Our findings underscore the importance of carefully selecting the most appropriate platform based on specific research questions.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqae190"},"PeriodicalIF":4.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11704781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142955985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phenotype prediction in plants is improved by integrating large-scale transcriptomic datasets.","authors":"Zefeng Wu, Yali Sun, Xiaoqiang Zhao, Zigang Liu, Wenqi Zhou, Yining Niu","doi":"10.1093/nargab/lqae184","DOIUrl":"10.1093/nargab/lqae184","url":null,"abstract":"<p><p>Research on the dynamic expression of genes in plants is important for understanding different biological processes. We used the large amounts of transcriptomic data from various plant sample sources that are publicly available to investigate whether the expression levels of a subset of highly variable genes (HVGs) can be used to accurately identify the phenotypes of plants. Using maize (<i>Zea mays</i> L.) as an example, we built machine learning (ML) models to predict phenotypes using a gene expression dataset of 21 612 bulk RNA sequencing samples. We showed that the ML models achieved excellent prediction accuracy using only the HVGs to identify different phenotypes, including tissue types, developmental stages, cultivars and stress conditions. By ML models, several important functional genes were found to be associated with different phenotypes. We performed a similar analysis in rice (<i>Orzya sativa</i> L.) and found that the ML models could be generalized across species. However, the models trained from maize did not perform well in rice, probably because of the expression divergence of the conserved HVGs between the two species. Overall, our results provide an ML framework for phenotype prediction using gene expression profiles, which may contribute to precision management of crops in agricultural practices.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae184"},"PeriodicalIF":4.0,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11672113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142903716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jalees A Nasir, Finlay Maguire, Kendrick M Smith, Emily M Panousis, Sheridan J C Baker, Patryk Aftanas, Amogelang R Raphenya, Brian P Alcock, Hassaan Maan, Natalie C Knox, Arinjay Banerjee, Karen Mossman, Bo Wang, Jared T Simpson, Robert A Kozak, Samira Mubareka, Andrew G McArthur
{"title":"SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL), a Snakemate workflow for rapid and bulk analysis of Illumina sequencing of SARS-CoV-2 genomes.","authors":"Jalees A Nasir, Finlay Maguire, Kendrick M Smith, Emily M Panousis, Sheridan J C Baker, Patryk Aftanas, Amogelang R Raphenya, Brian P Alcock, Hassaan Maan, Natalie C Knox, Arinjay Banerjee, Karen Mossman, Bo Wang, Jared T Simpson, Robert A Kozak, Samira Mubareka, Andrew G McArthur","doi":"10.1093/nargab/lqae176","DOIUrl":"10.1093/nargab/lqae176","url":null,"abstract":"<p><p>The incorporation of sequencing technologies in frontline and public health healthcare settings was vital in developing virus surveillance programs during the Coronavirus Disease 2019 (COVID-19) pandemic caused by transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, increased data acquisition poses challenges for both rapid and accurate analyses. To overcome these hurdles, we developed the SARS-CoV-2 Illumina GeNome Assembly Line (SIGNAL) for quick bulk analyses of Illumina short-read sequencing data. SIGNAL is a Snakemake workflow that seamlessly manages parallel tasks to process large volumes of sequencing data. A series of outputs are generated, including consensus genomes, variant calls, lineage assessments and identified variants of concern (VOCs). Compared to other existing SARS-CoV-2 sequencing workflows, SIGNAL is one of the fastest-performing analysis tools while maintaining high accuracy. The source code is publicly available (github.com/jaleezyy/covid-19-signal) and is optimized to run on various systems, with software compatibility and resource management all handled within the workflow. Overall, SIGNAL illustrated its capacity for high-volume analyses through several contributions to publicly funded government public health surveillance programs and can be a valuable tool for continuing SARS-CoV-2 Illumina sequencing efforts and will inform the development of similar strategies for rapid viral sequence assessment.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae176"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655287/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring transcription modalities from bimodal, single-cell RNA sequencing data.","authors":"Enikő Regényi, Mir-Farzin Mashreghi, Christof Schütte, Vikram Sunkara","doi":"10.1093/nargab/lqae179","DOIUrl":"10.1093/nargab/lqae179","url":null,"abstract":"<p><p>There is a growing interest in generating bimodal, single-cell RNA sequencing (RNA-seq) data for studying biological pathways. These data are predominantly utilized in understanding phenotypic trajectories using RNA velocities; however, the shape information encoded in the two-dimensional resolution of such data is not yet exploited. In this paper, we present an elliptical parametrization of two-dimensional RNA-seq data, from which we derived statistics that reveal four different modalities. These modalities can be interpreted as manifestations of the changes in the rates of splicing, transcription or degradation. We performed our analysis on a cell cycle and a colorectal cancer dataset. In both datasets, we found genes that are not picked up by differential gene expression analysis (DGEA), and are consequently unnoticed, yet visibly delineate phenotypes. This indicates that, in addition to DGEA, searching for genes that exhibit the discovered modalities could aid recovering genes that set phenotypes apart. For communities studying biomarkers and cellular phenotyping, the modalities present in bimodal RNA-seq data broaden the search space of genes, and furthermore, allow for incorporating cellular RNA processing into regulatory analyses.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae179"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655292/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining single-cell data for cell type-disease associations.","authors":"Kevin G Chen, Kathryn O Farley, Timo Lassmann","doi":"10.1093/nargab/lqae180","DOIUrl":"10.1093/nargab/lqae180","url":null,"abstract":"<p><p>A robust understanding of the cellular mechanisms underlying diseases sets the foundation for the effective design of drugs and other interventions. The wealth of existing single-cell atlases offers the opportunity to uncover high-resolution information on expression patterns across various cell types and time points. To better understand the associations between cell types and diseases, we leveraged previously developed tools to construct a standardized analysis pipeline and systematically explored associations across four single-cell datasets, spanning a range of tissue types, cell types and developmental time periods. We utilized a set of existing tools to identify co-expression modules and temporal patterns per cell type and then investigated these modules for known disease and phenotype enrichments. Our pipeline reveals known and novel putative cell type-disease associations across all investigated datasets. In addition, we found that automatically discovered gene co-expression modules and temporal clusters are enriched for drug targets, suggesting that our analysis could be used to identify novel therapeutic targets.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae180"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655289/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shibin Long, Yan Xia, Lifeng Liang, Ying Yang, Hailiang Xie, Xiaokai Wang
{"title":"PyNetCor: a high-performance Python package for large-scale correlation analysis.","authors":"Shibin Long, Yan Xia, Lifeng Liang, Ying Yang, Hailiang Xie, Xiaokai Wang","doi":"10.1093/nargab/lqae177","DOIUrl":"10.1093/nargab/lqae177","url":null,"abstract":"<p><p>The development of multi-omics technologies has generated an abundance of biological datasets, providing valuable resources for investigating potential relationships within complex biological systems. However, most correlation analysis tools face computational challenges when dealing with these high-dimensional datasets containing millions of features. Here, we introduce pyNetCor, a fast and scalable tool for constructing correlation networks on large-scale and high-dimensional data. PyNetCor features optimized algorithms for both full correlation coefficient matrix computation and top-k correlation search, outperforming other tools in the field in terms of runtime and memory consumption. It utilizes a linear interpolation strategy to rapidly estimate <i>P-</i>values and achieve false discovery rate control, demonstrating a speedup of over 110 times compared to existing methods. Overall, pyNetCor supports large-scale correlation analysis, a crucial foundational step for various bioinformatics workflows, and can be easily integrated into downstream applications to accelerate the process of extracting biological insights from data.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 4","pages":"lqae177"},"PeriodicalIF":4.0,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11655297/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142865719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}