Franco Liberati, Taiel Maximiliano Pose Marino, Paolo Bottoni, Daniele Canestrelli, Tiziana Castrignanò
{"title":"HPC-T-Assembly: a pipeline for de novo transcriptome assembly of large multi-specie datasets.","authors":"Franco Liberati, Taiel Maximiliano Pose Marino, Paolo Bottoni, Daniele Canestrelli, Tiziana Castrignanò","doi":"10.1186/s12859-025-06121-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Recent years have seen a substantial increase in RNA-seq data production, with this technique becoming the primary approach for gene expression studies across a wide range of non-model organisms. The majority of these organisms lack a well-annotated reference genome to serve as a basis for studying differentially expressed genes (DEGs). As an alternative cost-effective protocol to using a reference genome, the assembly of RNA-seq raw reads is performed to produce what is referred to as a 'de novo transcriptome,' serving as a reference for subsequent DEGs' analysis. This assembly step for conventional DEGs analysis pipelines for non-model organisms is a computationally expensive task. Furthermore, the complexity of the de novo transcriptome assembly workflows poses a challenge for researchers in implementing best-practice techniques and the most recent software versions, particularly when applied to various organisms of interest.</p><p><strong>Results: </strong>To address computational challenges in transcriptomic analyses of non-model organisms, we present HPC-T-Assembly, a tool for de novo transcriptome assembly from RNA-seq data on high-performance computing (HPC) infrastructures. It is designed for straightforward setup via a Web-oriented interface, allowing analysis configuration for several species. Once configuration data is provided, the entire parallel computing software for assembly is automatically generated and can be launched on a supercomputer with a simple command line. Intermediate and final outputs of the assembly pipeline include additional post-processing steps, such as assembly quality control, ORF prediction, and transcript count matrix construction.</p><p><strong>Conclusion: </strong>HPC-T-Assembly allows users, through a user-friendly Web-oriented interface, to configure a run for simultaneous assemblies of RNA-seq data from multiple species. The parallel pipeline, launched on HPC infrastructures, significantly reduces computational load and execution times, enabling large-scale transcriptomic and meta-transcriptomics analysis projects.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"113"},"PeriodicalIF":2.9000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12039220/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06121-4","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Recent years have seen a substantial increase in RNA-seq data production, with this technique becoming the primary approach for gene expression studies across a wide range of non-model organisms. The majority of these organisms lack a well-annotated reference genome to serve as a basis for studying differentially expressed genes (DEGs). As an alternative cost-effective protocol to using a reference genome, the assembly of RNA-seq raw reads is performed to produce what is referred to as a 'de novo transcriptome,' serving as a reference for subsequent DEGs' analysis. This assembly step for conventional DEGs analysis pipelines for non-model organisms is a computationally expensive task. Furthermore, the complexity of the de novo transcriptome assembly workflows poses a challenge for researchers in implementing best-practice techniques and the most recent software versions, particularly when applied to various organisms of interest.
Results: To address computational challenges in transcriptomic analyses of non-model organisms, we present HPC-T-Assembly, a tool for de novo transcriptome assembly from RNA-seq data on high-performance computing (HPC) infrastructures. It is designed for straightforward setup via a Web-oriented interface, allowing analysis configuration for several species. Once configuration data is provided, the entire parallel computing software for assembly is automatically generated and can be launched on a supercomputer with a simple command line. Intermediate and final outputs of the assembly pipeline include additional post-processing steps, such as assembly quality control, ORF prediction, and transcript count matrix construction.
Conclusion: HPC-T-Assembly allows users, through a user-friendly Web-oriented interface, to configure a run for simultaneous assemblies of RNA-seq data from multiple species. The parallel pipeline, launched on HPC infrastructures, significantly reduces computational load and execution times, enabling large-scale transcriptomic and meta-transcriptomics analysis projects.
背景:近年来,RNA-seq数据的产生大幅增加,该技术成为广泛的非模式生物基因表达研究的主要方法。这些生物中的大多数缺乏一个注释良好的参考基因组,作为研究差异表达基因(DEGs)的基础。作为使用参考基因组的另一种具有成本效益的方案,进行RNA-seq原始reads的组装以产生所谓的“de novo转录组”,作为后续deg分析的参考。非模式生物的传统DEGs分析管道的组装步骤是一项计算昂贵的任务。此外,从头转录组组装工作流程的复杂性对研究人员实施最佳实践技术和最新软件版本提出了挑战,特别是在应用于各种感兴趣的生物体时。结果:为了解决非模式生物转录组分析中的计算挑战,我们提出了HPC- t - assembly,这是一种从高性能计算(HPC)基础设施的RNA-seq数据中重新组装转录组的工具。它被设计为通过面向web的界面直接设置,允许对几个物种进行分析配置。一旦提供组态数据,整个装配并行计算软件就会自动生成,只需一个简单的命令行就可以在超级计算机上启动。装配流水线的中间和最终输出包括附加的后处理步骤,如装配质量控制、ORF预测和转录本计数矩阵构建。结论:HPC-T-Assembly允许用户通过一个用户友好的面向web的界面,配置一个运行来同时组装来自多个物种的RNA-seq数据。在HPC基础设施上启动的并行管道,显著降低了计算负载和执行时间,使大规模转录组学和元转录组学分析项目成为可能。
期刊介绍:
BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology.
BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.