{"title":"Scaling Synthetic Brain Data Generation","authors":"Mike Doan;Sergey Plis","doi":"10.1109/JBHI.2024.3520156","DOIUrl":null,"url":null,"abstract":"The limited availability of diverse, high-quality datasets is a significant challenge in applying deep learning to neuroimaging research. Although synthetic data generation can potentially address this issue, on-the-fly generation is computationally demanding, while training on pre-generated data is inflexible and may incur high storage costs. We introduce Wirehead, a scalable in-memory data pipeline that significantly improves the performance of on-the-fly synthetic data generation for deep learning in neuroimaging. Wirehead's architecture decouples data generation from training by running multiple generators in independent parallel processes, facilitating near-linear performance gains proportional to the number of generators used. It efficiently handles terabytes of data using MongoDB, greatly minimizing prohibitive storage costs. The robust, modular design enables flexible pipeline configurations and fault-tolerant operation. We evaluated Wirehead with SynthSeg, a synthetic brain segmentation data generation tool that requires 7 days to train a model. When deployed in parallel, Wirehead achieved a near-linear 15.7x increase in throughput with 16 generators. With 20 generators, we can train a model in 9 hours instead of 7 days. This demonstrates Wirehead's ability to greatly accelerate experimentation cycles. While Wirehead represents a substantial step forward, it also reveals opportunities for future research in optimizing generation-training balance and resource allocation. Its ability to facilitate distributed deep learning has significant implications for enabling more ambitious neuroimaging research.","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"29 2","pages":"840-847"},"PeriodicalIF":6.7000,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10807405/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The limited availability of diverse, high-quality datasets is a significant challenge in applying deep learning to neuroimaging research. Although synthetic data generation can potentially address this issue, on-the-fly generation is computationally demanding, while training on pre-generated data is inflexible and may incur high storage costs. We introduce Wirehead, a scalable in-memory data pipeline that significantly improves the performance of on-the-fly synthetic data generation for deep learning in neuroimaging. Wirehead's architecture decouples data generation from training by running multiple generators in independent parallel processes, facilitating near-linear performance gains proportional to the number of generators used. It efficiently handles terabytes of data using MongoDB, greatly minimizing prohibitive storage costs. The robust, modular design enables flexible pipeline configurations and fault-tolerant operation. We evaluated Wirehead with SynthSeg, a synthetic brain segmentation data generation tool that requires 7 days to train a model. When deployed in parallel, Wirehead achieved a near-linear 15.7x increase in throughput with 16 generators. With 20 generators, we can train a model in 9 hours instead of 7 days. This demonstrates Wirehead's ability to greatly accelerate experimentation cycles. While Wirehead represents a substantial step forward, it also reveals opportunities for future research in optimizing generation-training balance and resource allocation. Its ability to facilitate distributed deep learning has significant implications for enabling more ambitious neuroimaging research.
期刊介绍:
IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.