为生物医学工作负载优化高性能计算系统。

IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2020-05-01 Epub Date: 2020-07-28 DOI:10.1109/ipdpsw50202.2020.00040

Patricia Kovatch, Lili Gai, Hyung Min Cho, Eugene Fluder, Dansha Jiang

{"title":"为生物医学工作负载优化高性能计算系统。","authors":"Patricia Kovatch, Lili Gai, Hyung Min Cho, Eugene Fluder, Dansha Jiang","doi":"10.1109/ipdpsw50202.2020.00040","DOIUrl":null,"url":null,"abstract":"The productivity of computational biologists is limited by the speed of their workflows and subsequent overall job throughput. Because most biomedical researchers are focused on better understanding scientific phenomena rather than developing and optimizing code, a computing and data system implemented in an adventitious and/or non-optimized manner can impede the progress of scientific discovery. In our experience, most computational, life-science applications do not generally leverage the full capabilities of high-performance computing, so tuning a system for these applications is especially critical. To optimize a system effectively, systems staff must understand the effects of the applications on the system. Effective stewardship of the system includes an analysis of the impact of the applications on the compute cores, file system, resource manager and queuing policies. The resulting improved system design, and enactment of a sustainability plan, help to enable a long-term resource for productive computational and data science. We present a case study of a typical biomedical computational workload at a leading academic medical center supporting over $100 million per year in computational biology research. Over the past eight years, our high-performance computing system has enabled over 900 biomedical publications in four major areas: genetics and population analysis, gene expression, machine learning, and structural and chemical biology. We have upgraded the system several times in response to trends, actual usage, and user feedback. Major components crucial to this evolution include scheduling structure and policies, memory size, compute type and speed, parallel file system capabilities, and deployment of cloud technologies. We evolved a 70 teraflop machine to a 1.4 petaflop machine in seven years and grew our user base nearly 10-fold. For long-term stability and sustainability, we established a chargeback fee structure. Our overarching guiding principle for each progression has been to increase scientific throughput and enable enhanced scientific fidelity with minimal impact to existing user workflows or code. This highly-constrained system optimization has presented unique challenges, leading us to adopt new approaches to provide constructive pathways forward. We share our practical strategies resulting from our ongoing growth and assessments.","PeriodicalId":90848,"journal":{"name":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"2020 ","pages":"183-192"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7575271/pdf/nihms-1635815.pdf","citationCount":"0","resultStr":"{\"title\":\"Optimizing High-Performance Computing Systems for Biomedical Workloads.\",\"authors\":\"Patricia Kovatch, Lili Gai, Hyung Min Cho, Eugene Fluder, Dansha Jiang\",\"doi\":\"10.1109/ipdpsw50202.2020.00040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The productivity of computational biologists is limited by the speed of their workflows and subsequent overall job throughput. Because most biomedical researchers are focused on better understanding scientific phenomena rather than developing and optimizing code, a computing and data system implemented in an adventitious and/or non-optimized manner can impede the progress of scientific discovery. In our experience, most computational, life-science applications do not generally leverage the full capabilities of high-performance computing, so tuning a system for these applications is especially critical. To optimize a system effectively, systems staff must understand the effects of the applications on the system. Effective stewardship of the system includes an analysis of the impact of the applications on the compute cores, file system, resource manager and queuing policies. The resulting improved system design, and enactment of a sustainability plan, help to enable a long-term resource for productive computational and data science. We present a case study of a typical biomedical computational workload at a leading academic medical center supporting over $100 million per year in computational biology research. Over the past eight years, our high-performance computing system has enabled over 900 biomedical publications in four major areas: genetics and population analysis, gene expression, machine learning, and structural and chemical biology. We have upgraded the system several times in response to trends, actual usage, and user feedback. Major components crucial to this evolution include scheduling structure and policies, memory size, compute type and speed, parallel file system capabilities, and deployment of cloud technologies. We evolved a 70 teraflop machine to a 1.4 petaflop machine in seven years and grew our user base nearly 10-fold. For long-term stability and sustainability, we established a chargeback fee structure. Our overarching guiding principle for each progression has been to increase scientific throughput and enable enhanced scientific fidelity with minimal impact to existing user workflows or code. This highly-constrained system optimization has presented unique challenges, leading us to adopt new approaches to provide constructive pathways forward. We share our practical strategies resulting from our ongoing growth and assessments.\",\"PeriodicalId\":90848,\"journal\":{\"name\":\"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum\",\"volume\":\"2020 \",\"pages\":\"183-192\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7575271/pdf/nihms-1635815.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ipdpsw50202.2020.00040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2020/7/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdpsw50202.2020.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/7/28 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

计算生物学家的工作效率受限于他们的工作流程速度和随后的整体工作吞吐量。由于大多数生物医学研究人员都专注于更好地理解科学现象，而不是开发和优化代码，因此，计算和数据系统如果以不恰当和/或未优化的方式实施，就会阻碍科学发现的进展。根据我们的经验，大多数计算型生命科学应用通常不会充分利用高性能计算的全部功能，因此针对这些应用调整系统尤为重要。要有效优化系统，系统工作人员必须了解应用程序对系统的影响。有效的系统管理包括分析应用程序对计算核心、文件系统、资源管理器和队列策略的影响。由此改进的系统设计和制定的可持续发展计划有助于为富有成效的计算和数据科学提供长期资源。我们介绍了一个典型的生物医学计算工作量案例研究，该案例发生在一个领先的学术医学中心，每年支持超过 1 亿美元的计算生物学研究。在过去八年中，我们的高性能计算系统在遗传学和群体分析、基因表达、机器学习以及结构和化学生物学四大领域发表了 900 多篇生物医学论文。根据发展趋势、实际使用情况和用户反馈，我们对系统进行了多次升级。对这一演变至关重要的主要组成部分包括调度结构和策略、内存大小、计算类型和速度、并行文件系统功能以及云技术的部署。我们在七年内将 70 teraflop 的机器进化为 1.4 petaflop 的机器，用户数量增长了近 10 倍。为了实现长期稳定和可持续发展，我们建立了收费结构。我们每次进步的总体指导原则都是在对现有用户工作流程或代码影响最小的情况下，提高科学吞吐量，增强科学保真度。这种高度受限的系统优化带来了独特的挑战，促使我们采用新的方法来提供建设性的前进道路。我们将与大家分享在不断发展和评估过程中形成的实用策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing High-Performance Computing Systems for Biomedical Workloads.

The productivity of computational biologists is limited by the speed of their workflows and subsequent overall job throughput. Because most biomedical researchers are focused on better understanding scientific phenomena rather than developing and optimizing code, a computing and data system implemented in an adventitious and/or non-optimized manner can impede the progress of scientific discovery. In our experience, most computational, life-science applications do not generally leverage the full capabilities of high-performance computing, so tuning a system for these applications is especially critical. To optimize a system effectively, systems staff must understand the effects of the applications on the system. Effective stewardship of the system includes an analysis of the impact of the applications on the compute cores, file system, resource manager and queuing policies. The resulting improved system design, and enactment of a sustainability plan, help to enable a long-term resource for productive computational and data science. We present a case study of a typical biomedical computational workload at a leading academic medical center supporting over $100 million per year in computational biology research. Over the past eight years, our high-performance computing system has enabled over 900 biomedical publications in four major areas: genetics and population analysis, gene expression, machine learning, and structural and chemical biology. We have upgraded the system several times in response to trends, actual usage, and user feedback. Major components crucial to this evolution include scheduling structure and policies, memory size, compute type and speed, parallel file system capabilities, and deployment of cloud technologies. We evolved a 70 teraflop machine to a 1.4 petaflop machine in seven years and grew our user base nearly 10-fold. For long-term stability and sustainability, we established a chargeback fee structure. Our overarching guiding principle for each progression has been to increase scientific throughput and enable enhanced scientific fidelity with minimal impact to existing user workflows or code. This highly-constrained system optimization has presented unique challenges, leading us to adopt new approaches to provide constructive pathways forward. We share our practical strategies resulting from our ongoing growth and assessments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量