{"title":"Context-adjusted proportion of singletons (CAPS): a novel metric for assessing negative selection in the human genome.","authors":"Mikhail Gudkov, Loïc Thibaut, Eleni Giannoulatou","doi":"10.1093/nargab/lqae111","DOIUrl":"https://doi.org/10.1093/nargab/lqae111","url":null,"abstract":"<p><p>Interpretation of genetic variants remains challenging, partly due to the lack of well-established ways of determining the potential pathogenicity of genetic variation, especially for understudied classes of variants. Addressing this, population genetics methods offer a practical solution by evaluating variant effects through human population distributions. Negative selection influences the ratio of singleton variants and can serve as a proxy for deleteriousness, as exemplified by the Mutability-Adjusted Proportion of Singletons (MAPS) metric. However, MAPS is sensitive to the calibration of the singletons-by-mutability linear model, which results in biased estimates for certain variant classes. Building up on the methodology used in MAPS, we introduce the Context-Adjusted Proportion of Singletons (CAPS) metric for assessing negative selection in the human genome. CAPS produces corrected estimates with more accurate confidence intervals by eliminating the mutability layer in the model. Retaining the advantageous features of MAPS, CAPS emerges as a robust and reliable tool. We believe that CAPS has the potential to enhance the identification of new disease-variant associations in clinical and research settings, offering improved accuracy in assessing negative selection for diverse SNV classes.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae111"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reynold Yu, Huijing Xue, Wanru Lin, Francis S Collins, Stephen M Mount, Kan Cao
{"title":"Progerin mRNA expression in non-HGPS patients is correlated with widespread shifts in transcript isoforms.","authors":"Reynold Yu, Huijing Xue, Wanru Lin, Francis S Collins, Stephen M Mount, Kan Cao","doi":"10.1093/nargab/lqae115","DOIUrl":"https://doi.org/10.1093/nargab/lqae115","url":null,"abstract":"<p><p>Hutchinson-Gilford Progeria Syndrome (HGPS) is a premature aging disease caused primarily by a C1824T mutation in <i>LMNA</i>. This mutation activates a cryptic splice donor site, producing a lamin variant called progerin. Interestingly, progerin has also been detected in cells and tissues of non-HGPS patients. Here, we investigated progerin expression using publicly available RNA-seq data from non-HGPS patients in the GTEx project. We found that progerin expression is present across all tissue types in non-HGPS patients and correlated with telomere shortening in the skin. Transcriptome-wide correlation analyses suggest that the level of progerin expression is correlated with switches in gene isoform expression patterns. Differential expression analyses show that progerin expression is correlated with significant changes in genes involved in splicing regulation and mitochondrial function. Interestingly, 5' splice sites whose use is correlated with progerin expression have significantly altered frequencies of consensus trinucleotides within the core 5' splice site. Furthermore, introns whose alternative splicing correlates with progerin have reduced GC content. Our study suggests that progerin expression in non-HGPS patients is part of a global shift in splicing patterns.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae115"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358823/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning and direct sequencing of labeled RNA captures transcriptome dynamics.","authors":"Vlastimil Martinek, Jessica Martin, Cedric Belair, Matthew J Payea, Sulochan Malla, Panagiotis Alexiou, Manolis Maragkakis","doi":"10.1093/nargab/lqae116","DOIUrl":"10.1093/nargab/lqae116","url":null,"abstract":"<p><p>In eukaryotes, genes produce a variety of distinct RNA isoforms, each with potentially unique protein products, coding potential or regulatory signals such as poly(A) tail and nucleotide modifications. Assessing the kinetics of RNA isoform metabolism, such as transcription and decay rates, is essential for unraveling gene regulation. However, it is currently impeded by lack of methods that can differentiate between individual isoforms. Here, we introduce RNAkinet, a deep convolutional and recurrent neural network, to detect nascent RNA molecules following metabolic labeling with the nucleoside analog 5-ethynyl uridine and long-read, direct RNA sequencing with nanopores. RNAkinet processes electrical signals from nanopore sequencing directly and distinguishes nascent from pre-existing RNA molecules. Our results show that RNAkinet prediction performance generalizes in various cell types and organisms and can be used to quantify RNA isoform half-lives. RNAkinet is expected to enable the identification of the kinetic parameters of RNA isoforms and to facilitate studies of RNA metabolism and the regulatory elements that influence it.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae116"},"PeriodicalIF":4.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11358824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142112832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Camilleri-Robles, Raziel Amador, Marcel Tiebe, Aurelio A Teleman, Florenci Serras, Roderic Guigó, Montserrat Corominas
{"title":"Long non-coding RNAs involved in <i>Drosophila</i> development and regeneration.","authors":"Carlos Camilleri-Robles, Raziel Amador, Marcel Tiebe, Aurelio A Teleman, Florenci Serras, Roderic Guigó, Montserrat Corominas","doi":"10.1093/nargab/lqae091","DOIUrl":"10.1093/nargab/lqae091","url":null,"abstract":"<p><p>The discovery of functional long non-coding RNAs (lncRNAs) changed their initial concept as transcriptional noise. LncRNAs have been identified as regulators of multiple biological processes, including chromatin structure, gene expression, splicing, mRNA degradation, and translation. However, functional studies of lncRNAs are hindered by the usual lack of phenotypes upon deletion or inhibition. Here, we used <i>Drosophila</i> imaginal discs as a model system to identify lncRNAs involved in development and regeneration. We examined a subset of lncRNAs expressed in the wing, leg, and eye disc development. Additionally, we analyzed transcriptomic data from regenerating wing discs to profile the expression pattern of lncRNAs during tissue repair. We focused on the lncRNA <i>CR40469</i>, which is upregulated during regeneration. We generated <i>CR40469</i> mutant flies that developed normally but showed impaired wing regeneration upon cell death induction. The ability of these mutants to regenerate was restored by the ectopic expression of <i>CR40469</i>. Furthermore, we found that the lncRNA <i>CR34335</i> has a high degree of sequence similarity with <i>CR40469</i> and can partially compensate for its function during regeneration in the absence of <i>CR40469</i>. Our findings point to a potential role of the lncRNA <i>CR40469</i> in <i>trans</i> during the response to damage in the wing imaginal disc.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae091"},"PeriodicalIF":4.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327875/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142000843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Nathan J LeRoy, Aidong Zhang, Nathan C Sheffield
{"title":"Methods for evaluating unsupervised vector representations of genomic regions.","authors":"Guangtao Zheng, Julia Rymuza, Erfaneh Gharavi, Nathan J LeRoy, Aidong Zhang, Nathan C Sheffield","doi":"10.1093/nargab/lqae086","DOIUrl":"10.1093/nargab/lqae086","url":null,"abstract":"<p><p>Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae086"},"PeriodicalIF":4.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11316252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander J Ritter, Andrew Wallace, Neda Ronaghi, Jeremy R Sanford
{"title":"junctionCounts: comprehensive alternative splicing analysis and prediction of isoform-level impacts to the coding sequence.","authors":"Alexander J Ritter, Andrew Wallace, Neda Ronaghi, Jeremy R Sanford","doi":"10.1093/nargab/lqae093","DOIUrl":"10.1093/nargab/lqae093","url":null,"abstract":"<p><p>Alternative splicing (AS) is emerging as an important regulatory process for complex biological processes. Transcriptomic studies therefore commonly involve the identification and quantification of alternative processing events, but the need for predicting the functional consequences of changes to the relative inclusion of alternative events remains largely unaddressed. Many tools exist for the former task, albeit each constrained to its own event type definitions. Few tools exist for the latter task; each with significant limitations. To address these issues we developed junctionCounts, which captures both simple and complex pairwise AS events and quantifies them with straightforward exon-exon and exon-intron junction reads in RNA-seq data, performing competitively among similar tools in terms of sensitivity, false discovery rate and quantification accuracy. Its partner utility, cdsInsertion, identifies transcript coding sequence (CDS) information via <i>in silico</i> translation from annotated start codons, including the presence of premature termination codons. Finally, findSwitchEvents connects AS events with CDS information to predict the impact of individual events to the isoform-level CDS. We used junctionCounts to characterize splicing dynamics and NMD regulation during neuronal differentiation across four primates, demonstrating junctionCounts' capacity to robustly characterize AS in a variety of organisms and to predict its effect on mRNA isoform fate.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae093"},"PeriodicalIF":4.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Veerendra P Gadekar, Alexander Welford Munk, Milad Miladi, Alexander Junge, Rolf Backofen, Stefan E Seemann, Jan Gorodkin
{"title":"Clusters of mammalian conserved RNA structures in UTRs associate with RBP binding sites.","authors":"Veerendra P Gadekar, Alexander Welford Munk, Milad Miladi, Alexander Junge, Rolf Backofen, Stefan E Seemann, Jan Gorodkin","doi":"10.1093/nargab/lqae089","DOIUrl":"10.1093/nargab/lqae089","url":null,"abstract":"<p><p>RNA secondary structures play essential roles in the formation of the tertiary structure and function of a transcript. Recent genome-wide studies highlight significant potential for RNA structures in the mammalian genome. However, a major challenge is assigning functional roles to these structured RNAs. In this study, we conduct a guilt-by-association analysis of clusters of computationally predicted conserved RNA structure (CRSs) in human untranslated regions (UTRs) to associate them with gene functions. We filtered a broad pool of ∼500 000 human CRSs for UTR overlap, resulting in 4734 and 24 754 CRSs from the 5' and 3' UTR of protein-coding genes, respectively. We separately clustered these CRSs for both sets using RNAscClust, obtaining 793 and 2403 clusters, each containing an average of five CRSs per cluster. We identified overrepresented binding sites for 60 and 43 RNA-binding proteins co-localizing with the clustered CRSs. Furthermore, 104 and 441 clusters from the 5' and 3' UTRs, respectively, showed enrichment for various Gene Ontologies, including biological processes such as 'signal transduction', 'nervous system development', molecular functions like 'transferase activity' and the cellular components such as 'synapse' among others. Our study shows that significant functional insights can be gained by clustering RNA structures based on their structural characteristics.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae089"},"PeriodicalIF":4.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310781/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141917556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phylogenetic distribution of DNA topoisomerase VI and its distinction from SPO11.","authors":"Adam M B Allen, Anthony Maxwell","doi":"10.1093/nargab/lqae085","DOIUrl":"10.1093/nargab/lqae085","url":null,"abstract":"<p><p>DNA topoisomerases (topos) are major targets for antimicrobial and chemotherapeutic drugs due to their fundamental roles in regulating DNA topology. Type II topos are essential for chromosome segregation and relaxing positive DNA supercoils, and are exemplified by topo II in eukaryotes, topo IV and DNA gyrase in bacteria, and topo VI in archaea. Topo VI occurs ubiquitously in plants and sporadically in bacteria, algae, and other protists and is highly homologous to Spo11, which initiates eukaryotic homologous recombination. This homology makes the two complexes difficult to distinguish by sequence and leads to discrepancies such as the identity of the putative topo VI in malarial <i>Plasmodium</i> species. A lack of understanding of the role and distribution of topo VI outside of archaea hampers its pursuit as a potential drug target, and the present study addresses this with an up-to-date and extensive phylogenetic analysis. We show that the A and B subunits of topo VI and Spo11 can be distinguished using phylogenetics and structural modelling, and that topo VI is not present in <i>Plasmodium</i> nor other members of the phylum Apicomplexa. These findings provide insights into the evolutionary relationships between topo VI and Spo11, and their adoption alongside other type II topos.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae085"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141898521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PANDA-3D: protein function prediction based on AlphaFold models.","authors":"Chenguang Zhao, Tong Liu, Zheng Wang","doi":"10.1093/nargab/lqae094","DOIUrl":"10.1093/nargab/lqae094","url":null,"abstract":"<p><p>Previous protein function predictors primarily make predictions from amino acid sequences instead of tertiary structures because of the limited number of experimentally determined structures and the unsatisfying qualities of predicted structures. AlphaFold recently achieved promising performances when predicting protein tertiary structures, and the AlphaFold protein structure database (AlphaFold DB) is fast-expanding. Therefore, we aimed to develop a deep-learning tool that is specifically trained with AlphaFold models and predict GO terms from AlphaFold models. We developed an advanced learning architecture by combining geometric vector perceptron graph neural networks and variant transformer decoder layers for multi-label classification. PANDA-3D predicts gene ontology (GO) terms from the predicted structures of AlphaFold and the embeddings of amino acid sequences based on a large language model. Our method significantly outperformed a state-of-the-art deep-learning method that was trained with experimentally determined tertiary structures, and either outperformed or was comparable with several other language-model-based state-of-the-art methods with amino acid sequences as input. PANDA-3D is tailored to AlphaFold models, and the AlphaFold DB currently contains over 200 million predicted protein structures (as of May 1st, 2023), making PANDA-3D a useful tool that can accurately annotate the functions of a large number of proteins. PANDA-3D can be freely accessed as a web server from http://dna.cs.miami.edu/PANDA-3D/ and as a repository from https://github.com/zwang-bioinformatics/PANDA-3D.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae094"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141898488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanhang Liu, Robert A Vierkant, Aditya Bhagwate, William A Jons, Melody L Stallings-Mann, Bryan M McCauley, Jodi M Carter, Melissa T Stephens, Michael E Pfrender, Laurie E Littlepage, Derek C Radisky, Julie M Cunningham, Amy C Degnim, Stacey J Winham, Chen Wang
{"title":"Evaluating cell type deconvolution in FFPE breast tissue: application to benign breast disease.","authors":"Yuanhang Liu, Robert A Vierkant, Aditya Bhagwate, William A Jons, Melody L Stallings-Mann, Bryan M McCauley, Jodi M Carter, Melissa T Stephens, Michael E Pfrender, Laurie E Littlepage, Derek C Radisky, Julie M Cunningham, Amy C Degnim, Stacey J Winham, Chen Wang","doi":"10.1093/nargab/lqae098","DOIUrl":"https://doi.org/10.1093/nargab/lqae098","url":null,"abstract":"<p><p>Transcriptome profiling using RNA sequencing (RNA-seq) of bulk formalin-fixed paraffin-embedded (FFPE) tissue blocks is a standard method in biomedical research. However, when used on tissues with diverse cell type compositions, it yields averaged gene expression profiles, complicating biomarker identification due to variations in cell proportions. To address the need for optimized strategies for defining individual cell type compositions from bulk FFPE samples, we constructed single-cell RNA-seq reference data for breast tissue and tested cell type deconvolution methods. Initial simulation experiments showed similar performances across multiple commonly used deconvolution methods. However, the introduction of FFPE artifacts significantly impacted their performances, with a root mean squared error (RMSE) ranging between 0.04 and 0.17. Scaden, a deep learning-based method, consistently outperformed the others, demonstrating robustness against FFPE artifacts. Testing these methods on our 62-sample RNA-seq benign breast disease cohort in which cell type composition was estimated using digital pathology approaches, we found that pre-filtering of the reference data enhanced the accuracy of most methods, realizing up to a 32% reduction in RMSE. To support further research efforts in this domain, we introduce SCdeconR, an R package designed for streamlined cell type deconvolution assessments and downstream analyses.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 3","pages":"lqae098"},"PeriodicalIF":4.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11952925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143754738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}