{"title":"Evaluation of machine learning models that predict lncRNA subcellular localization.","authors":"Jason R Miller, Weijun Yi, Donald A Adjeroh","doi":"10.1093/nargab/lqae125","DOIUrl":"https://doi.org/10.1093/nargab/lqae125","url":null,"abstract":"<p><p>The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, <i>e.g</i>. 72-74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this 'middle exclusion' protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae125"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mateusz Garbulowski, Thomas Hillerton, Daniel Morgan, Deniz Seçilmiş, Lisbet Sonnhammer, Andreas Tjärnberg, Torbjörn E M Nordling, Erik L L Sonnhammer
{"title":"GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data.","authors":"Mateusz Garbulowski, Thomas Hillerton, Daniel Morgan, Deniz Seçilmiş, Lisbet Sonnhammer, Andreas Tjärnberg, Torbjörn E M Nordling, Erik L L Sonnhammer","doi":"10.1093/nargab/lqae121","DOIUrl":"https://doi.org/10.1093/nargab/lqae121","url":null,"abstract":"<p><p>Single-cell data is increasingly used for gene regulatory network (GRN) inference, and benchmarks for this have been developed based on simulated data. However, existing single-cell simulators cannot model the effects of gene perturbations. A further challenge lies in generating large-scale GRNs that often struggle with computational and stability issues. We present GeneSPIDER2, an update of the GeneSPIDER MATLAB toolbox for GRN benchmarking, inference, and analysis. Several software modules have improved capabilities and performance, and new functionalities have been added. A major improvement is the ability to generate large GRNs with biologically realistic topological properties in terms of scale-free degree distribution and modularity. Another major addition is a simulation of single-cell data, which is becoming increasingly popular as input for GRN inference. Specifically, we introduced the unique feature to generate single-cell data based on genetic perturbations. Finally, the simulated single-cell data was compared to real single-cell Perturb-seq data from two cell lines, showing that the synthetic and real data exhibit similar properties.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae121"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409065/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.","authors":"Jeremy Ratcliff","doi":"10.1093/nargab/lqae129","DOIUrl":"https://doi.org/10.1093/nargab/lqae129","url":null,"abstract":"<p><p>Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA's ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 <i>de novo</i> synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% (<i>n</i> = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae129"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409064/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to <b>'</b>long non-coding RNAs involved in <i>Drosophila</i> development and regeneration'.","authors":"","doi":"10.1093/nargab/lqae127","DOIUrl":"https://doi.org/10.1093/nargab/lqae127","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nargab/lqae091.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae127"},"PeriodicalIF":4.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11400925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria-Anna Trapotsi, Jasper van Lopik, Gregory J Hannon, Benjamin Czech Nicholson, Susanne Bornelöv
{"title":"FlaHMM: unistrand <i>flamenco</i>-like piRNA cluster prediction in <i>Drosophila</i> species using hidden Markov models.","authors":"Maria-Anna Trapotsi, Jasper van Lopik, Gregory J Hannon, Benjamin Czech Nicholson, Susanne Bornelöv","doi":"10.1093/nargab/lqae119","DOIUrl":"https://doi.org/10.1093/nargab/lqae119","url":null,"abstract":"<p><p>PIWI-interacting RNAs (piRNAs) are a class of small non-coding RNAs that are essential for transposon control in animal gonads. In <i>Drosophila</i> ovarian somatic cells, piRNAs are transcribed from large genomic regions called piRNA clusters, which are enriched for transposon fragments and act as a memory of past invasions. Despite being widely present across <i>Drosophila</i> species, somatic piRNA clusters are difficult to identify and study due to their lack of sequence conservation and limited synteny. Current identification methods rely on either extensive manual curation or availability of high-throughput small RNA sequencing data, limiting large-scale comparative studies. We now present FlaHMM, a hidden Markov model developed to automate genomic annotation of <i>flamenco</i>-like unistrand piRNA clusters in <i>Drosophila</i> species, requiring only a genome assembly and transposon annotations. FlaHMM uses transposable element content across 5- or 10-kb bins, which can be calculated from genome sequence alone, and is thus able to detect candidate piRNA clusters without the need to obtain flies and experimentally perform small RNA sequencing. We show that FlaHMM performs on par with piRNA-guided or manual methods, and thus provides a scalable and efficient approach to piRNA cluster annotation in new genome assemblies. FlaHMM is freely available at https://github.com/Hannon-lab/FlaHMM under an MIT licence.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae119"},"PeriodicalIF":4.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11400887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to 'Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites'.","authors":"","doi":"10.1093/nargab/lqae120","DOIUrl":"https://doi.org/10.1093/nargab/lqae120","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nar/lqae089.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae120"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning of metabolite-protein interactions from model-derived metabolic phenotypes.","authors":"Mahdis Habibpour, Zahra Razaghi-Moghadam, Zoran Nikoloski","doi":"10.1093/nargab/lqae114","DOIUrl":"10.1093/nargab/lqae114","url":null,"abstract":"<p><p>Unraveling metabolite-protein interactions is key to identifying the mechanisms by which metabolism affects the function of other cellular layers. Despite extensive experimental and computational efforts to identify the regulatory roles of metabolites in interaction with proteins, it remains challenging to achieve a genome-scale coverage of these interactions. Here, we leverage established gold standards for metabolite-protein interactions to train supervised classifiers using features derived from genome-scale metabolic models and matched data on protein abundance and reaction fluxes to distinguish interacting from non-interacting pairs. Through a comprehensive comparative study, we explore the impact of different features and assess the effect of gold standards for non-interacting pairs on the performance of the classifiers. Using data sets from <i>Escherichia coli</i> and <i>Saccharomyces cerevisiae</i>, we demonstrate that the features constructed by integrating fluxomic and proteomic data with metabolic phenotypes predicted from genome-scale metabolic models can be effectively used to train classifiers, accurately predicting metabolite-protein interactions in the context of metabolism. Our results reveal that the high performance of classifiers trained on these features is unaffected by the method used to generate gold standards for non-interacting pairs. Overall, our study introduces valuable features that improve the performance of identifying metabolite-protein interactions in the context of metabolism.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae114"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369697/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Petr Novák, Nina Hoštáková, Pavel Neumann, Jiří Macas
{"title":"DANTE and DANTE_LTR: lineage-centric annotation pipelines for long terminal repeat retrotransposons in plant genomes.","authors":"Petr Novák, Nina Hoštáková, Pavel Neumann, Jiří Macas","doi":"10.1093/nargab/lqae113","DOIUrl":"https://doi.org/10.1093/nargab/lqae113","url":null,"abstract":"<p><p>Long terminal repeat (LTR) retrotransposons constitute a predominant class of repetitive DNA elements in most plant genomes. With the increasing number of sequenced plant genomes, there is an ongoing demand for computational tools facilitating efficient annotation and classification of LTR retrotransposons in plant genome assemblies. Herein, we introduce DANTE, a computational pipeline for Domain-based ANnotation of Transposable Elements, designed for sensitive detection of these elements via their conserved protein domain sequences. The identified protein domains are subsequently inputted into the DANTE_LTR pipeline to annotate complete element sequences by detecting their structural features, such as LTRs, in adjacent genomic regions. Leveraging domain sequences allows for precise classification of elements into phylogenetic lineages, offering a more granular annotation compared with coarser conventional superfamily-based classification methods. The efficiency and accuracy of this approach were evidenced via annotation of LTR retrotransposons in 93 plant genomes. Results were benchmarked against several established pipelines, showing that DANTE_LTR is capable of identifying significantly more intact LTR retrotransposons. DANTE and DANTE_LTR are provided as user-friendly Galaxy tools accessible via a public server (https://repeatexplorer-elixir.cerit-sc.cz), installable on local Galaxy instances from the Galaxy tool shed or executable from the command line.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae113"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358816/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera
{"title":"SpikeFlow: automated and flexible analysis of ChIP-Seq data with spike-in control.","authors":"Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera","doi":"10.1093/nargab/lqae118","DOIUrl":"https://doi.org/10.1093/nargab/lqae118","url":null,"abstract":"<p><p>ChIP with reference exogenous genome (ChIP-Rx) is widely used to study histone modification changes across different biological conditions. A key step in the bioinformatics analysis of this data is calculating the normalization factors, which vary from the standard ChIP-seq pipelines. Choosing and applying the appropriate normalization method is crucial for interpreting the biological results. However, a comprehensive pipeline for complete ChIP-Rx data analysis is lacking. To address these challenges, we introduce SpikeFlow, an integrated Snakemake workflow that combines features from various existing tools to streamline ChIP-Rx data processing and enhance usability. SpikeFlow automates spike-in data scaling and provides multiple normalization options. It also performs peak calling and differential analysis with distinct modalities, enabling the detection of enrichment regions for histone modifications and transcription factor binding. Our workflow runs in-depth quality control at all the processing steps and generates an analysis report with tables and graphs to facilitate results interpretation. We validated the pipeline by performing a comparative analysis with DiffBind and SpikChIP, demonstrating robust performances in various biological models. By combining diverse functionalities into a single platform, SpikeFlow aims to simplify ChIP-Rx data analysis for the research community.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae118"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context-adjusted proportion of singletons (CAPS): a novel metric for assessing negative selection in the human genome.","authors":"Mikhail Gudkov, Loïc Thibaut, Eleni Giannoulatou","doi":"10.1093/nargab/lqae111","DOIUrl":"https://doi.org/10.1093/nargab/lqae111","url":null,"abstract":"<p><p>Interpretation of genetic variants remains challenging, partly due to the lack of well-established ways of determining the potential pathogenicity of genetic variation, especially for understudied classes of variants. Addressing this, population genetics methods offer a practical solution by evaluating variant effects through human population distributions. Negative selection influences the ratio of singleton variants and can serve as a proxy for deleteriousness, as exemplified by the Mutability-Adjusted Proportion of Singletons (MAPS) metric. However, MAPS is sensitive to the calibration of the singletons-by-mutability linear model, which results in biased estimates for certain variant classes. Building up on the methodology used in MAPS, we introduce the Context-Adjusted Proportion of Singletons (CAPS) metric for assessing negative selection in the human genome. CAPS produces corrected estimates with more accurate confidence intervals by eliminating the mutability layer in the model. Retaining the advantageous features of MAPS, CAPS emerges as a robust and reliable tool. We believe that CAPS has the potential to enhance the identification of new disease-variant associations in clinical and research settings, offering improved accuracy in assessing negative selection for diverse SNV classes.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae111"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}