Viola Fanfani, Katherine H Shutta, Panagiotis Mandros, Jonas Fischer, Enakshi Saha, Soel Micheletti, Chen Chen, Marouen Ben Guebila, Camila M Lopes-Ramos, John Quackenbush
{"title":"Reproducible processing of TCGA regulatory networks.","authors":"Viola Fanfani, Katherine H Shutta, Panagiotis Mandros, Jonas Fischer, Enakshi Saha, Soel Micheletti, Chen Chen, Marouen Ben Guebila, Camila M Lopes-Ramos, John Quackenbush","doi":"10.1093/gigascience/giaf126","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a valuable framework for analyzing omics data and modeling regulatory interactions between genes and proteins. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods, resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks, a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.</p><p><strong>Findings: </strong>We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omic data, such as RNA-seq and methylation, are (i) downloaded, (ii) pre-processed, and (iii) analyzed to infer regulatory network models with the Network Zoo. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here, we demonstrate how the pipeline can be used to investigate the differences between colon cancer subtypes attributed to epigenetic mechanisms. Lastly, we provide a database of pre-generated networks for the 10 most common cancer types that can be readily accessed by the public.</p><p><strong>Conclusions: </strong>tcga-data-nf is a complete, yet flexible and extensible, framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools for analyzing TCGA data.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":" ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf126","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a valuable framework for analyzing omics data and modeling regulatory interactions between genes and proteins. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods, resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks, a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.
Findings: We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omic data, such as RNA-seq and methylation, are (i) downloaded, (ii) pre-processed, and (iii) analyzed to infer regulatory network models with the Network Zoo. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here, we demonstrate how the pipeline can be used to investigate the differences between colon cancer subtypes attributed to epigenetic mechanisms. Lastly, we provide a database of pre-generated networks for the 10 most common cancer types that can be readily accessed by the public.
Conclusions: tcga-data-nf is a complete, yet flexible and extensible, framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools for analyzing TCGA data.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.