{"title":"A generalized protein identification method for novel and diverse sequencing technologies.","authors":"Bikash Kumar Bhandari, Nick Goldman","doi":"10.1093/nargab/lqae126","DOIUrl":"https://doi.org/10.1093/nargab/lqae126","url":null,"abstract":"<p><p>Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving <i>de novo</i> sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database (<i>N</i> = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae126"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409062/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of machine learning models that predict lncRNA subcellular localization.","authors":"Jason R Miller, Weijun Yi, Donald A Adjeroh","doi":"10.1093/nargab/lqae125","DOIUrl":"https://doi.org/10.1093/nargab/lqae125","url":null,"abstract":"<p><p>The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, <i>e.g</i>. 72-74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this 'middle exclusion' protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae125"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mateusz Garbulowski, Thomas Hillerton, Daniel Morgan, Deniz Seçilmiş, Lisbet Sonnhammer, Andreas Tjärnberg, Torbjörn E M Nordling, Erik L L Sonnhammer
{"title":"GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data.","authors":"Mateusz Garbulowski, Thomas Hillerton, Daniel Morgan, Deniz Seçilmiş, Lisbet Sonnhammer, Andreas Tjärnberg, Torbjörn E M Nordling, Erik L L Sonnhammer","doi":"10.1093/nargab/lqae121","DOIUrl":"https://doi.org/10.1093/nargab/lqae121","url":null,"abstract":"<p><p>Single-cell data is increasingly used for gene regulatory network (GRN) inference, and benchmarks for this have been developed based on simulated data. However, existing single-cell simulators cannot model the effects of gene perturbations. A further challenge lies in generating large-scale GRNs that often struggle with computational and stability issues. We present GeneSPIDER2, an update of the GeneSPIDER MATLAB toolbox for GRN benchmarking, inference, and analysis. Several software modules have improved capabilities and performance, and new functionalities have been added. A major improvement is the ability to generate large GRNs with biologically realistic topological properties in terms of scale-free degree distribution and modularity. Another major addition is a simulation of single-cell data, which is becoming increasingly popular as input for GRN inference. Specifically, we introduced the unique feature to generate single-cell data based on genetic perturbations. Finally, the simulated single-cell data was compared to real single-cell Perturb-seq data from two cell lines, showing that the synthetic and real data exhibit similar properties.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae121"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409065/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences.","authors":"Jeremy Ratcliff","doi":"10.1093/nargab/lqae129","DOIUrl":"https://doi.org/10.1093/nargab/lqae129","url":null,"abstract":"<p><p>Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA's ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 <i>de novo</i> synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% (<i>n</i> = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae129"},"PeriodicalIF":4.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11409064/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to <b>'</b>long non-coding RNAs involved in <i>Drosophila</i> development and regeneration'.","authors":"","doi":"10.1093/nargab/lqae127","DOIUrl":"https://doi.org/10.1093/nargab/lqae127","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nargab/lqae091.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae127"},"PeriodicalIF":4.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11400925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria-Anna Trapotsi, Jasper van Lopik, Gregory J Hannon, Benjamin Czech Nicholson, Susanne Bornelöv
{"title":"FlaHMM: unistrand <i>flamenco</i>-like piRNA cluster prediction in <i>Drosophila</i> species using hidden Markov models.","authors":"Maria-Anna Trapotsi, Jasper van Lopik, Gregory J Hannon, Benjamin Czech Nicholson, Susanne Bornelöv","doi":"10.1093/nargab/lqae119","DOIUrl":"10.1093/nargab/lqae119","url":null,"abstract":"<p><p>PIWI-interacting RNAs (piRNAs) are a class of small non-coding RNAs that are essential for transposon control in animal gonads. In <i>Drosophila</i> ovarian somatic cells, piRNAs are transcribed from large genomic regions called piRNA clusters, which are enriched for transposon fragments and act as a memory of past invasions. Despite being widely present across <i>Drosophila</i> species, somatic piRNA clusters are difficult to identify and study due to their lack of sequence conservation and limited synteny. Current identification methods rely on either extensive manual curation or availability of high-throughput small RNA sequencing data, limiting large-scale comparative studies. We now present FlaHMM, a hidden Markov model developed to automate genomic annotation of <i>flamenco</i>-like unistrand piRNA clusters in <i>Drosophila</i> species, requiring only a genome assembly and transposon annotations. FlaHMM uses transposable element content across 5- or 10-kb bins, which can be calculated from genome sequence alone, and is thus able to detect candidate piRNA clusters without the need to obtain flies and experimentally perform small RNA sequencing. We show that FlaHMM performs on par with piRNA-guided or manual methods, and thus provides a scalable and efficient approach to piRNA cluster annotation in new genome assemblies. FlaHMM is freely available at https://github.com/Hannon-lab/FlaHMM under an MIT licence.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae119"},"PeriodicalIF":4.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11400887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142297110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to 'Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites'.","authors":"","doi":"10.1093/nargab/lqae120","DOIUrl":"https://doi.org/10.1093/nargab/lqae120","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.1093/nar/lqae089.].</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae120"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning of metabolite-protein interactions from model-derived metabolic phenotypes.","authors":"Mahdis Habibpour, Zahra Razaghi-Moghadam, Zoran Nikoloski","doi":"10.1093/nargab/lqae114","DOIUrl":"10.1093/nargab/lqae114","url":null,"abstract":"<p><p>Unraveling metabolite-protein interactions is key to identifying the mechanisms by which metabolism affects the function of other cellular layers. Despite extensive experimental and computational efforts to identify the regulatory roles of metabolites in interaction with proteins, it remains challenging to achieve a genome-scale coverage of these interactions. Here, we leverage established gold standards for metabolite-protein interactions to train supervised classifiers using features derived from genome-scale metabolic models and matched data on protein abundance and reaction fluxes to distinguish interacting from non-interacting pairs. Through a comprehensive comparative study, we explore the impact of different features and assess the effect of gold standards for non-interacting pairs on the performance of the classifiers. Using data sets from <i>Escherichia coli</i> and <i>Saccharomyces cerevisiae</i>, we demonstrate that the features constructed by integrating fluxomic and proteomic data with metabolic phenotypes predicted from genome-scale metabolic models can be effectively used to train classifiers, accurately predicting metabolite-protein interactions in the context of metabolism. Our results reveal that the high performance of classifiers trained on these features is unaffected by the method used to generate gold standards for non-interacting pairs. Overall, our study introduces valuable features that improve the performance of identifying metabolite-protein interactions in the context of metabolism.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae114"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11369697/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142126868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Petr Novák, Nina Hoštáková, Pavel Neumann, Jiří Macas
{"title":"DANTE and DANTE_LTR: lineage-centric annotation pipelines for long terminal repeat retrotransposons in plant genomes.","authors":"Petr Novák, Nina Hoštáková, Pavel Neumann, Jiří Macas","doi":"10.1093/nargab/lqae113","DOIUrl":"https://doi.org/10.1093/nargab/lqae113","url":null,"abstract":"<p><p>Long terminal repeat (LTR) retrotransposons constitute a predominant class of repetitive DNA elements in most plant genomes. With the increasing number of sequenced plant genomes, there is an ongoing demand for computational tools facilitating efficient annotation and classification of LTR retrotransposons in plant genome assemblies. Herein, we introduce DANTE, a computational pipeline for Domain-based ANnotation of Transposable Elements, designed for sensitive detection of these elements via their conserved protein domain sequences. The identified protein domains are subsequently inputted into the DANTE_LTR pipeline to annotate complete element sequences by detecting their structural features, such as LTRs, in adjacent genomic regions. Leveraging domain sequences allows for precise classification of elements into phylogenetic lineages, offering a more granular annotation compared with coarser conventional superfamily-based classification methods. The efficiency and accuracy of this approach were evidenced via annotation of LTR retrotransposons in 93 plant genomes. Results were benchmarked against several established pipelines, showing that DANTE_LTR is capable of identifying significantly more intact LTR retrotransposons. DANTE and DANTE_LTR are provided as user-friendly Galaxy tools accessible via a public server (https://repeatexplorer-elixir.cerit-sc.cz), installable on local Galaxy instances from the Galaxy tool shed or executable from the command line.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae113"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358816/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera
{"title":"SpikeFlow: automated and flexible analysis of ChIP-Seq data with spike-in control.","authors":"Davide Bressan, Daniel Fernández-Pérez, Alessandro Romanel, Fulvio Chiacchiera","doi":"10.1093/nargab/lqae118","DOIUrl":"https://doi.org/10.1093/nargab/lqae118","url":null,"abstract":"<p><p>ChIP with reference exogenous genome (ChIP-Rx) is widely used to study histone modification changes across different biological conditions. A key step in the bioinformatics analysis of this data is calculating the normalization factors, which vary from the standard ChIP-seq pipelines. Choosing and applying the appropriate normalization method is crucial for interpreting the biological results. However, a comprehensive pipeline for complete ChIP-Rx data analysis is lacking. To address these challenges, we introduce SpikeFlow, an integrated Snakemake workflow that combines features from various existing tools to streamline ChIP-Rx data processing and enhance usability. SpikeFlow automates spike-in data scaling and provides multiple normalization options. It also performs peak calling and differential analysis with distinct modalities, enabling the detection of enrichment regions for histone modifications and transcription factor binding. Our workflow runs in-depth quality control at all the processing steps and generates an analysis report with tables and graphs to facilitate results interpretation. We validated the pipeline by performing a comparative analysis with DiffBind and SpikChIP, demonstrating robust performances in various biological models. By combining diverse functionalities into a single platform, SpikeFlow aims to simplify ChIP-Rx data analysis for the research community.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae118"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}