Rhalena A Thomas, Michael R Fiorini, Saeid Amiri, Edward A Fon, Sali M K Farhan
{"title":"ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems.","authors":"Rhalena A Thomas, Michael R Fiorini, Saeid Amiri, Edward A Fon, Sali M K Farhan","doi":"10.1186/s12859-024-05935-y","DOIUrl":"10.1186/s12859-024-05935-y","url":null,"abstract":"<p><strong>Background: </strong>Single-cell RNA sequencing (scRNAseq) offers powerful insights, but the surge in sample sizes demands more computational power than local workstations can provide. Consequently, high-performance computing (HPC) systems have become imperative. Existing web apps designed to analyze scRNAseq data lack scalability and integration capabilities, while analysis packages demand coding expertise, hindering accessibility.</p><p><strong>Results: </strong>In response, we introduce scRNAbox, an innovative scRNAseq analysis pipeline meticulously crafted for HPC systems. This end-to-end solution, executed via the SLURM workload manager, efficiently processes raw data from standard and Hashtag samples. It incorporates quality control filtering, sample integration, clustering, cluster annotation tools, and facilitates cell type-specific differential gene expression analysis between two groups. We demonstrate the application of scRNAbox by analyzing two publicly available datasets.</p><p><strong>Conclusion: </strong>ScRNAbox is a comprehensive end-to-end pipeline designed to streamline the processing and analysis of scRNAseq data. By responding to the pressing demand for a user-friendly, HPC solution, scRNAbox bridges the gap between the growing computational demands of scRNAseq analysis and the coding expertise required to meet them.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"319"},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient and low-complexity variable-to-variable length coding for DNA storage.","authors":"Yunfei Gao, Albert No","doi":"10.1186/s12859-024-05943-y","DOIUrl":"10.1186/s12859-024-05943-y","url":null,"abstract":"<p><strong>Background: </strong>Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between <math><mrow><mo>[</mo> <mn>0.5</mn> <mo>-</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>,</mo> <mn>0.5</mn> <mo>+</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>]</mo></mrow> </math> (GC content constraint <math><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> </math> ). Sequencing or synthesis errors tend to increase when these constraints are violated.</p><p><strong>Results: </strong>In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when <math><mrow><mi>h</mi> <mo>=</mo> <mn>4</mn></mrow> </math> and <math> <mrow><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>=</mo> <mn>0.05</mn></mrow> </math> , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.</p><p><strong>Conclusion: </strong>We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"320"},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446080/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sushma Koirala, Harman Sharma, Yee Lian Chew, Anna Konopka
{"title":"SimplySmart_v1, a new tool for the analysis of DNA damage optimized in primary neuronal cultures.","authors":"Sushma Koirala, Harman Sharma, Yee Lian Chew, Anna Konopka","doi":"10.1186/s12859-024-05947-8","DOIUrl":"10.1186/s12859-024-05947-8","url":null,"abstract":"<p><strong>Background: </strong>The increased interest in research on DNA damage in neurodegeneration has created a need for the development of tools dedicated to the analysis of DNA damage in neurons. Double-stranded breaks (DSBs) are among the most detrimental types of DNA damage and have become a subject of intensive research. DSBs result in DNA damage foci, which are detectable with the marker γH2AX. Manual counting of DNA damage foci is challenging and biased, and there is a lack of open-source programs optimized specifically in neurons. Thus, we developed a new, fully automated application, SimplySmart_v1, for DNA damage quantification and optimized its performance specifically in primary neurons cultured in vitro.</p><p><strong>Results: </strong>Compared with control neurons, SimplySmart_v1 accurately identifies the induction of DNA damage with etoposide in primary neurons. It also accurately quantifies DNA damage in the desired fraction of cells and processes a batch of images within a few seconds. SimplySmart_v1 was also capable of quantifying DNA damage effectively regardless of the cell type (neuron or NSC-34). The comparative analysis of SimplySmart_v1 with other open-source tools, such as Fiji, CellProfiler and a focinator, revealed that SimplySmart_v1 is the most 'user-friendly' and the quickest tool among others and provides highly accurate results free of variability between measurements. In the context of neurodegenerative research, SimplySmart_v1 revealed an increase in DNA damage in primary neurons expressing abnormal TAR DNA/RNA binding protein (TDP-43).</p><p><strong>Conclusions: </strong>These findings showed that SimplySmart_v1 is a new and effective tool for research on DNA damage and can successfully replace other available software.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"318"},"PeriodicalIF":2.9,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting RNA sequence-structure likelihood via structure-aware deep learning.","authors":"You Zhou, Giulia Pedrielli, Fei Zhang, Teresa Wu","doi":"10.1186/s12859-024-05916-1","DOIUrl":"10.1186/s12859-024-05916-1","url":null,"abstract":"<p><strong>Background: </strong>The active functionalities of RNA are recognized to be heavily dependent on the structure and sequence. Therefore, a model that can accurately evaluate a design by giving RNA sequence-structure pairs would be a valuable tool for many researchers. Machine learning methods have been explored to develop such tools, showing promising results. However, two key issues remain. Firstly, the performance of machine learning models is affected by the features used to characterize RNA. Currently, there is no consensus on which features are the most effective for characterizing RNA sequence-structure pairs. Secondly, most existing machine learning methods extract features describing entire RNA molecule. We argue that it is essential to define additional features that characterize nucleotides and specific sections of RNA structure to enhance the overall efficacy of the RNA design process.</p><p><strong>Results: </strong>We develop two deep learning models for evaluating RNA sequence-secondary structure pairs. The first model, NU-ResNet, uses a convolutional neural network architecture that solves the aforementioned problems by explicitly encoding RNA sequence-structure information into a 3D matrix. Building upon NU-ResNet, our second model, NUMO-ResNet, incorporates additional information derived from the characterizations of RNA, specifically the 2D folding motifs. In this work, we introduce an automated method to extract these motifs based on fundamental secondary structure descriptions. We evaluate the performance of both models on an independent testing dataset. Our proposed models outperform the models from literatures in this independent testing dataset. To assess the robustness of our models, we conduct 10-fold cross validation. To evaluate the generalization ability of NU-ResNet and NUMO-ResNet across different RNA families, we train and test our proposed models in different RNA families. Our proposed models show superior performance compared to the models from literatures when being tested across different independent RNA families.</p><p><strong>Conclusions: </strong>In this study, we propose two deep learning models, NU-ResNet and NUMO-ResNet, to evaluate RNA sequence-secondary structure pairs. These two models expand the field of data-driven approaches for learning RNA. Furthermore, these two models provide the new method to encode RNA sequence-secondary structure pairs.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"316"},"PeriodicalIF":2.9,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11443715/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FindCSV: a long-read based method for detecting complex structural variations.","authors":"Yan Zheng, Xuequn Shang","doi":"10.1186/s12859-024-05937-w","DOIUrl":"https://doi.org/10.1186/s12859-024-05937-w","url":null,"abstract":"<p><strong>Background: </strong>Structural variations play a significant role in genetic diseases and evolutionary mechanisms. Extensive research has been conducted over the past decade to detect simple structural variations, leading to the development of well-established detection methods. However, recent studies have highlighted the potentially greater impact of complex structural variations on individuals compared to simple structural variations. Despite this, the field still lacks precise detection methods specifically designed for complex structural variations. Therefore, the development of a highly efficient and accurate detection method is of utmost importance.</p><p><strong>Result: </strong>In response to this need, we propose a novel method called FindCSV, which leverages deep learning techniques and consensus sequences to enhance the detection of SVs using long-read sequencing data. Compared to current methods, FindCSV performs better in detecting complex and simple structural variations.</p><p><strong>Conclusions: </strong>FindCSV is a new method to detect complex and simple structural variations with reasonable accuracy in real and simulated data. The source code for the program is available at https://github.com/nwpuzhengyan/FindCSV .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"315"},"PeriodicalIF":2.9,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11439270/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teng Li, Yiran Zou, Xianghan Li, Thomas K F Wong, Allen G Rodrigo
{"title":"Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data.","authors":"Teng Li, Yiran Zou, Xianghan Li, Thomas K F Wong, Allen G Rodrigo","doi":"10.1186/s12859-024-05928-x","DOIUrl":"https://doi.org/10.1186/s12859-024-05928-x","url":null,"abstract":"<p><strong>Background: </strong>The application of Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction and visualization has revolutionized the analysis of single-cell RNA expression and population genetics. However, its potential in single-cell DNA sequencing data analysis, particularly for visualizing gene mutation information, has not been fully explored.</p><p><strong>Results: </strong>We introduce Mugen-UMAP, a novel Python-based program that extends UMAP's utility to single-cell DNA sequencing data. This innovative tool provides a comprehensive pipeline for processing gene annotation files of single-cell somatic single-nucleotide variants and metadata to the visualization of UMAP projections for identifying clusters, along with various statistical analyses. Employing Mugen-UMAP, we analyzed whole-exome sequencing data from 365 single-cell samples across 12 non-small cell lung cancer (NSCLC) patients, revealing distinct clusters associated with histological subtypes of NSCLC. Moreover, to demonstrate the general utility of Mugen-UMAP, we applied the program to 9 additional single-cell WES datasets from various cancer types, uncovering interesting patterns of cell clusters that warrant further investigation. In summary, Mugen-UMAP provides a quick and effective visualization method to uncover cell cluster patterns based on the gene mutation information from single-cell DNA sequencing data.</p><p><strong>Conclusions: </strong>The application of Mugen-UMAP demonstrates its capacity to provide valuable insights into the visualization and interpretation of single-cell DNA sequencing data. Mugen-UMAP can be found at https://github.com/tengchn/Mugen-UMAP.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"308"},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437917/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harley Edwards, Joseph Zavorskas, Walker Huso, Alexander G Doan, Caton Silbiger, Steven Harris, Ranjan Srivastava, Mark R Marten
{"title":"Using flux theory in dynamic omics data sets to identify differentially changing signals using DPoP.","authors":"Harley Edwards, Joseph Zavorskas, Walker Huso, Alexander G Doan, Caton Silbiger, Steven Harris, Ranjan Srivastava, Mark R Marten","doi":"10.1186/s12859-024-05938-9","DOIUrl":"https://doi.org/10.1186/s12859-024-05938-9","url":null,"abstract":"<p><strong>Background: </strong>Derivative profiling is a novel approach to identify differential signals from dynamic omics data sets. This approach applies variable step-size differentiation to time dynamic omics data. This work assumes that there is a general omics derivative that is a useful and descriptive feature of dynamic omics experiments. We assert that this omics derivative, or omics flux, is a valuable descriptor that can be used instead of, or with, fold change calculations.</p><p><strong>Results: </strong>The results of derivative profiling are compared to established methods such as Multivariate Adaptive Regression Splines, significance versus fold change analysis (Volcano), and an adjusted ratio over intensity (M/A) analysis to find that there is a statistically significant similarity between the results. This comparison is repeated for transcriptomic and phosphoproteomic expression profiles previously characterized in Aspergillus nidulans. This method has been packaged in an open-source, GUI-based MATLAB app, the Derivative Profiling omics Package (DPoP). Gene Ontology (GO) term enrichment has been included in the app so that a user can automatically/programmatically describe the over/under-represented GO terms in the derivative profiling results using domain specific knowledge found in their organism's specific GO database file. The advantage of the DPoP analysis is that it is computationally inexpensive, it does not require fold change calculations, it describes both instantaneous as well as overall behavior, and it achieves statistical confidence with signal trajectories of a single bio-replicate over four or more points.</p><p><strong>Conclusions: </strong>While we apply this method to time dynamic transcriptomic and phosphoproteomic datasets, it is a numerically generalizable technique that can be applied to any organism and any field interested in time series data analysis. The app described in this work enables omics researchers with no computer science background to apply derivative profiling to their data sets, while also allowing multidisciplined users to build on the nascent idea of profiling derivatives in omics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"312"},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11437665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LOCC: a novel visualization and scoring of cutoffs for continuous variables with hepatocellular carcinoma prognosis as an example.","authors":"George Luo, Toby Chen, John J Letterio","doi":"10.1186/s12859-024-05932-1","DOIUrl":"10.1186/s12859-024-05932-1","url":null,"abstract":"<p><strong>Background: </strong>The interpretation of large datasets, such as The Cancer Genome Atlas (TCGA), for scientific and research purposes, remains challenging despite their public availability. In this study, we focused on identifying gene expression profiles most relevant to patient prognosis and aimed to develop a method and database to address this issue. To achieve this, we introduced Luo's Optimization Categorization Curve (LOCC), an innovative tool for visualizing and scoring continuous variables against dichotomous outcomes. To demonstrate the efficacy of LOCC using real-world data, we analyzed gene expression profiles and patient data from TCGA hepatocellular carcinoma samples.</p><p><strong>Results: </strong>To showcase LOCC, we demonstrate an optimal cutoff for E2F1 expression in hepatocellular carcinoma, which was subsequently validated in an independent cohort. Compared to ROC curves and their AUC, LOCC offered a superior description of the predictive value of E2F1 expression across various cancer types. The LOCC score, comprised of factors representing significance, range, and impact of the biomarker, facilitated the ranking of all gene expression profiles in hepatocellular carcinoma, aiding in the evaluation and understanding of previously published prognostic gene signatures. We also demonstrate that LOCC does not have the same assumptions required of Cox proportional hazards modeling for accurate analysis. Repeated sampling demonstrated that LOCC scores outperformed ROC's AUC in discriminating predictors from non-predictors. Additionally, gene set enrichment analysis revealed significant associations between certain genes and prognosis, such as E2F target genes and G2M checkpoint with poor prognosis, and bile acid metabolism and oxidative phosphorylation with good prognosis.</p><p><strong>Conclusion: </strong>In summary, we present LOCC as a novel visualization tool for the analysis of gene expression in cancer, particularly for understanding and selecting cutoffs. Our findings suggest that LOCC scores, which effectively rank genes based on their prognostic potential, represent a more suitable approach than ROC curves and Cox proportional hazard for prognostic modeling and understanding in cancer gene expression analysis. LOCC holds promise as an invaluable tool for advancing precision medicine and furthering biomarker research. Further research regarding multivariable integration and validation will help LOCC reach its full potential and establish its utility across diverse cancer types and clinical settings.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"314"},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438210/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling cell type-specific lncRNA regulatory network in autism with Cycle.","authors":"Chenchen Xiong, Mingfang Zhang, Haolin Yang, Xuemei Wei, Chunwen Zhao, Junpeng Zhang","doi":"10.1186/s12859-024-05933-0","DOIUrl":"https://doi.org/10.1186/s12859-024-05933-0","url":null,"abstract":"<p><strong>Background: </strong>Autism spectrum disorder (ASD) is a class of complex neurodevelopment disorders with high genetic heterogeneity. Long non-coding RNAs (lncRNAs) are vital regulators that perform specific functions within diverse cell types and play pivotal roles in neurological diseases including ASD. Therefore, exploring lncRNA regulation would contribute to deciphering ASD molecular mechanisms. Existing computational methods utilize bulk transcriptomics data to identify lncRNA regulation in all of samples, which could reveal the commonalities of lncRNA regulation in ASD, but ignore the specificity of lncRNA regulation across various cell types.</p><p><strong>Results: </strong>Here, we present Cycle (Cell type-specific lncRNA regulatory network) to construct the landscape of cell type-specific lncRNA regulation in ASD. We have found that each ASD cell type is unique in lncRNA regulation, and more than one-third and all cell type-specific lncRNA regulatory networks are characterized as scale-free and small-world, respectively. Across 17 ASD cell types, we have discovered 19 rewired and 11 stable modules, along with eight rewired and three stable hubs within the constructed cell type-specific lncRNA regulatory networks. Enrichment analysis reveals that the discovered rewired and stable modules and hubs are closely related to ASD. Furthermore, more similar ASD cell types tend to be connected with higher strength in the constructed cell similarity network. Finally, the comparison results demonstrate that Cycle is a potential method for uncovering cell type-specific lncRNA regulation.</p><p><strong>Conclusion: </strong>Overall, these results illustrate that Cycle is a promising method to model the landscape of cell type-specific lncRNA regulation, and provides insights into understanding the heterogeneity of lncRNA regulation between various ASD cell types.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"307"},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11430139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GenRCA: a user-friendly rare codon analysis tool for comprehensive evaluation of codon usage preferences based on coding sequences in genomes.","authors":"Kunjie Fan, Yuanyuan Li, Zhiwei Chen, Long Fan","doi":"10.1186/s12859-024-05934-z","DOIUrl":"https://doi.org/10.1186/s12859-024-05934-z","url":null,"abstract":"<p><strong>Background: </strong>The study of codon usage bias is important for understanding gene expression, evolution and gene design, providing critical insights into the molecular processes that govern the function and regulation of genes. Codon Usage Bias (CUB) indices are valuable metrics for understanding codon usage patterns across different organisms without extensive experiments. Considering that there is no one-fits-all index for all species, a comprehensive platform supporting the calculation and analysis of multiple CUB indices for codon optimization is greatly needed.</p><p><strong>Results: </strong>Here, we release GenRCA, an updated version of our previous Rare Codon Analysis Tool, as a free and user-friendly website for all-inclusive evaluation of codon usage preferences of coding sequences. In this study, we manually reviewed and implemented up to 31 codon preference indices, with 65 expression host organisms covered and batch processing of multiple gene sequences supported, aiming to improve the user experience and provide more comprehensive and efficient analysis.</p><p><strong>Conclusions: </strong>Our website fills a gap in the availability of comprehensive tools for species-specific CUB calculations, enabling researchers to thoroughly assess the protein expression level based on a comprehensive list of 31 indices and further guide the codon optimization.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"309"},"PeriodicalIF":2.9,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11438159/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}