Reynold Yu, Huijing Xue, Wanru Lin, Francis S Collins, Stephen M Mount, Kan Cao
{"title":"Progerin mRNA expression in non-HGPS patients is correlated with widespread shifts in transcript isoforms.","authors":"Reynold Yu, Huijing Xue, Wanru Lin, Francis S Collins, Stephen M Mount, Kan Cao","doi":"10.1093/nargab/lqae115","DOIUrl":"https://doi.org/10.1093/nargab/lqae115","url":null,"abstract":"<p><p>Hutchinson-Gilford Progeria Syndrome (HGPS) is a premature aging disease caused primarily by a C1824T mutation in <i>LMNA</i>. This mutation activates a cryptic splice donor site, producing a lamin variant called progerin. Interestingly, progerin has also been detected in cells and tissues of non-HGPS patients. Here, we investigated progerin expression using publicly available RNA-seq data from non-HGPS patients in the GTEx project. We found that progerin expression is present across all tissue types in non-HGPS patients and correlated with telomere shortening in the skin. Transcriptome-wide correlation analyses suggest that the level of progerin expression is correlated with switches in gene isoform expression patterns. Differential expression analyses show that progerin expression is correlated with significant changes in genes involved in splicing regulation and mitochondrial function. Interestingly, 5' splice sites whose use is correlated with progerin expression have significantly altered frequencies of consensus trinucleotides within the core 5' splice site. Furthermore, introns whose alternative splicing correlates with progerin have reduced GC content. Our study suggests that progerin expression in non-HGPS patients is part of a global shift in splicing patterns.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae115"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358823/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning and direct sequencing of labeled RNA captures transcriptome dynamics.","authors":"Vlastimil Martinek, Jessica Martin, Cedric Belair, Matthew J Payea, Sulochan Malla, Panagiotis Alexiou, Manolis Maragkakis","doi":"10.1093/nargab/lqae116","DOIUrl":"10.1093/nargab/lqae116","url":null,"abstract":"<p><p>In eukaryotes, genes produce a variety of distinct RNA isoforms, each with potentially unique protein products, coding potential or regulatory signals such as poly(A) tail and nucleotide modifications. Assessing the kinetics of RNA isoform metabolism, such as transcription and decay rates, is essential for unraveling gene regulation. However, it is currently impeded by lack of methods that can differentiate between individual isoforms. Here, we introduce RNAkinet, a deep convolutional and recurrent neural network, to detect nascent RNA molecules following metabolic labeling with the nucleoside analog 5-ethynyl uridine and long-read, direct RNA sequencing with nanopores. RNAkinet processes electrical signals from nanopore sequencing directly and distinguishes nascent from pre-existing RNA molecules. Our results show that RNAkinet prediction performance generalizes in various cell types and organisms and can be used to quantify RNA isoform half-lives. RNAkinet is expected to enable the identification of the kinetic parameters of RNA isoforms and to facilitate studies of RNA metabolism and the regulatory elements that influence it.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae116"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Camilleri-Robles, Raziel Amador, Marcel Tiebe, Aurelio A Teleman, Florenci Serras, Roderic Guigó, Montserrat Corominas
{"title":"Long non-coding RNAs involved in <i>Drosophila</i> development and regeneration.","authors":"Carlos Camilleri-Robles, Raziel Amador, Marcel Tiebe, Aurelio A Teleman, Florenci Serras, Roderic Guigó, Montserrat Corominas","doi":"10.1093/nargab/lqae091","DOIUrl":"10.1093/nargab/lqae091","url":null,"abstract":"<p><p>The discovery of functional long non-coding RNAs (lncRNAs) changed their initial concept as transcriptional noise. LncRNAs have been identified as regulators of multiple biological processes, including chromatin structure, gene expression, splicing, mRNA degradation, and translation. However, functional studies of lncRNAs are hindered by the usual lack of phenotypes upon deletion or inhibition. Here, we used <i>Drosophila</i> imaginal discs as a model system to identify lncRNAs involved in development and regeneration. We examined a subset of lncRNAs expressed in the wing, leg, and eye disc development. Additionally, we analyzed transcriptomic data from regenerating wing discs to profile the expression pattern of lncRNAs during tissue repair. We focused on the lncRNA <i>CR40469</i>, which is upregulated during regeneration. We generated <i>CR40469</i> mutant flies that developed normally but showed impaired wing regeneration upon cell death induction. The ability of these mutants to regenerate was restored by the ectopic expression of <i>CR40469</i>. Furthermore, we found that the lncRNA <i>CR34335</i> has a high degree of sequence similarity with <i>CR40469</i> and can partially compensate for its function during regeneration in the absence of <i>CR40469</i>. Our findings point to a potential role of the lncRNA <i>CR40469</i> in <i>trans</i> during the response to damage in the wing imaginal disc.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae091"},"PeriodicalIF":4.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327875/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142000843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Nathan J LeRoy, Aidong Zhang, Nathan C Sheffield
{"title":"Methods for evaluating unsupervised vector representations of genomic regions.","authors":"Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Nathan J LeRoy, Aidong Zhang, Nathan C Sheffield","doi":"10.1093/nargab/lqae086","DOIUrl":"10.1093/nargab/lqae086","url":null,"abstract":"<p><p>Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae086"},"PeriodicalIF":4.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander J Ritter, Andrew Wallace, Neda Ronaghi, Jeremy R Sanford
{"title":"junctionCounts: comprehensive alternative splicing analysis and prediction of isoform-level impacts to the coding sequence.","authors":"Alexander J Ritter, Andrew Wallace, Neda Ronaghi, Jeremy R Sanford","doi":"10.1093/nargab/lqae093","DOIUrl":"10.1093/nargab/lqae093","url":null,"abstract":"<p><p>Alternative splicing (AS) is emerging as an important regulatory process for complex biological processes. Transcriptomic studies therefore commonly involve the identification and quantification of alternative processing events, but the need for predicting the functional consequences of changes to the relative inclusion of alternative events remains largely unaddressed. Many tools exist for the former task, albeit each constrained to its own event type definitions. Few tools exist for the latter task; each with significant limitations. To address these issues we developed junctionCounts, which captures both simple and complex pairwise AS events and quantifies them with straightforward exon-exon and exon-intron junction reads in RNA-seq data, performing competitively among similar tools in terms of sensitivity, false discovery rate and quantification accuracy. Its partner utility, cdsInsertion, identifies transcript coding sequence (CDS) information via <i>in silico</i> translation from annotated start codons, including the presence of premature termination codons. Finally, findSwitchEvents connects AS events with CDS information to predict the impact of individual events to the isoform-level CDS. We used junctionCounts to characterize splicing dynamics and NMD regulation during neuronal differentiation across four primates, demonstrating junctionCounts' capacity to robustly characterize AS in a variety of organisms and to predict its effect on mRNA isoform fate.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae093"},"PeriodicalIF":4.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Veerendra P Gadekar, Alexander Welford Munk, Milad Miladi, Alexander Junge, Rolf Backofen, Stefan E Seemann, Jan Gorodkin
{"title":"Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites.","authors":"Veerendra P Gadekar, Alexander Welford Munk, Milad Miladi, Alexander Junge, Rolf Backofen, Stefan E Seemann, Jan Gorodkin","doi":"10.1093/nargab/lqae089","DOIUrl":"10.1093/nargab/lqae089","url":null,"abstract":"<p><p>RNA secondary structures play essential roles in the formation of the tertiary structure and function of a transcript. Recent genome-wide studies highlight significant potential for RNA structures in the mammalian genome. However, a major challenge is assigning functional roles to these structured RNAs. In this study, we conduct a guilt-by-association analysis of clusters of computationally predicted conserved RNA structure (CRSs) in human untranslated regions (UTRs) to associate them with gene functions. We filtered a broad pool of ∼500 000 human CRSs for UTR overlap, resulting in 4734 and 24 754 CRSs from the 5' and 3' UTR of protein-coding genes, respectively. We separately clustered these CRSs for both sets using RNAscClust, obtaining 793 and 2403 clusters, each containing an average of five CRSs per cluster. We identified overrepresented binding sites for 60 and 43 RNA-binding proteins co-localizing with the clustered CRSs. Furthermore, 104 and 441 clusters from the 5' and 3' UTRs, respectively, showed enrichment for various Gene Ontologies, including biological processes such as 'signal transduction', 'nervous system development', molecular functions like 'transferase activity' and the cellular components such as 'synapse' among others. Our study shows that significant functional insights can be gained by clustering RNA structures based on their structural characteristics.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae089"},"PeriodicalIF":4.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phylogenetic distribution of DNA topoisomerase VI and its distinction from SPO11.","authors":"Adam M B Allen, Anthony Maxwell","doi":"10.1093/nargab/lqae085","DOIUrl":"10.1093/nargab/lqae085","url":null,"abstract":"<p><p>DNA topoisomerases (topos) are major targets for antimicrobial and chemotherapeutic drugs due to their fundamental roles in regulating DNA topology. Type II topos are essential for chromosome segregation and relaxing positive DNA supercoils, and are exemplified by topo II in eukaryotes, topo IV and DNA gyrase in bacteria, and topo VI in archaea. Topo VI occurs ubiquitously in plants and sporadically in bacteria, algae, and other protists and is highly homologous to Spo11, which initiates eukaryotic homologous recombination. This homology makes the two complexes difficult to distinguish by sequence and leads to discrepancies such as the identity of the putative topo VI in malarial <i>Plasmodium</i> species. A lack of understanding of the role and distribution of topo VI outside of archaea hampers its pursuit as a potential drug target, and the present study addresses this with an up-to-date and extensive phylogenetic analysis. We show that the A and B subunits of topo VI and Spo11 can be distinguished using phylogenetics and structural modelling, and that topo VI is not present in <i>Plasmodium</i> nor other members of the phylum Apicomplexa. These findings provide insights into the evolutionary relationships between topo VI and Spo11, and their adoption alongside other type II topos.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae085"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141898521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PANDA-3D: protein function prediction based on AlphaFold models.","authors":"Chenguang Zhao, Tong Liu, Zheng Wang","doi":"10.1093/nargab/lqae094","DOIUrl":"10.1093/nargab/lqae094","url":null,"abstract":"<p><p>Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Our method significantly outperformed a state-of-the-art deep-learning method that was trained with experimentally determined tertiary structures, and either outperformed or was comparable with several other language-model-based state-of-the-art methods with amino acid sequences as input. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae094"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141898488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing hybrid ensemble feature selection strategies for transcriptomic biomarker discovery in complex diseases.","authors":"Elsa Claude, Mickaël Leclercq, Patricia Thébault, Arnaud Droit, Raluca Uricaru","doi":"10.1093/nargab/lqae079","DOIUrl":"10.1093/nargab/lqae079","url":null,"abstract":"<p><p>Biomedical research takes advantage of omic data, such as transcriptomics, to unravel the complexity of diseases. A conventional strategy identifies transcriptomic biomarkers characterized by expression patterns associated with a phenotype by relying on feature selection approaches. Hybrid ensemble feature selection (HEFS) has become increasingly popular as it ensures robustness of the selected features by performing data and functional perturbations. However, it remains difficult to make the best suited choices at each step when designing such approaches. We conducted an extensive analysis of four possible HEFS scenarios for the identification of Stage IV colorectal, Stage I kidney and lung and Stage III endometrial cancer biomarkers from transcriptomic data. These scenarios investigate the use of two types of feature reduction by filters (differentially expressed genes and variance) conjointly with two types of resampling strategies (repeated holdout by distribution-balanced stratified and random stratified) for downstream feature selection through an aggregation of thousands of wrapped machine learning models. Based on our results, we emphasize the advantages of using HEFS approaches to identify complex disease biomarkers, given their ability to produce generalizable and stable results to both data and functional perturbations. Finally, we highlight critical issues that need to be considered in the design of such strategies.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae079"},"PeriodicalIF":4.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11237901/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141591574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathan J LeRoy, Jason P Smith, Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Donald E Brown, Aidong Zhang, Nathan C Sheffield
{"title":"Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.","authors":"Nathan J LeRoy, Jason P Smith, Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Donald E Brown, Aidong Zhang, Nathan C Sheffield","doi":"10.1093/nargab/lqae073","DOIUrl":"10.1093/nargab/lqae073","url":null,"abstract":"<p><p>Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae073"},"PeriodicalIF":4.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11224678/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141555620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}