Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis最新文献_第2页

A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems 在高性能计算系统中采用分解存储器的定量方法

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-08-28 DOI: 10.48550/arXiv.2308.14780

Jacob Wahlgren, Gabin Schieffer, M. Gokhale, I. Peng

{"title":"A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems","authors":"Jacob Wahlgren, Gabin Schieffer, M. Gokhale, I. Peng","doi":"10.48550/arXiv.2308.14780","DOIUrl":"https://doi.org/10.48550/arXiv.2308.14780","url":null,"abstract":"Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132646183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Breaking Boundaries: Distributed Domain Decomposition with Scalable Physics-Informed Neural PDE Solvers 打破边界：分布式领域分解与可扩展的物理信息神经 PDE 求解器

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-08-28 DOI: 10.1145/3581784.3613217

Arthur Feeney, Zitong Li, R. Bostanabad, Aparna Chandramowlishwaran

{"title":"Breaking Boundaries: Distributed Domain Decomposition with Scalable Physics-Informed Neural PDE Solvers","authors":"Arthur Feeney, Zitong Li, R. Bostanabad, Aparna Chandramowlishwaran","doi":"10.1145/3581784.3613217","DOIUrl":"https://doi.org/10.1145/3581784.3613217","url":null,"abstract":"Mosaic Flow is a novel domain decomposition method designed to scale physics-informed neural PDE solvers to large domains. Its unique approach leverages pre-trained networks on small domains to solve partial differential equations on large domains purely through inference, resulting in high reusability. This paper presents an end-to-end parallelization of Mosaic Flow, combining data parallel training and domain parallelism for inference on large-scale problems. By optimizing the network architecture and data parallel training, we significantly reduce the training time for learning the Laplacian operator to minutes on 32 GPUs. Moreover, our distributed domain decomposition algorithm enables scalable inferences for solving the Laplace equation on domains 4096× larger than the training domain, demonstrating strong scaling while maintaining accuracy on 32 GPUs. The reusability of Mosaic Flow, combined with the improved performance achieved through the distributed-memory algorithms, makes it a promising tool for modeling complex physical phenomena and accelerating scientific discovery.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139348728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward Exascale Computation for Turbomachinery Flows 实现涡轮机械流动的超大规模计算

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-08-12 DOI: 10.1145/3581784.3627040

Yuhang Fu, Weiqi Shen, J. Cui, Yao Zheng, Guangwen Yang, Zhao Liu, Jifa Zhang, Tingwei Ji, Fangfang Xie, Xiaojing Lv, Hanyue Liu, Xu Liu, Xiyang Liu, Xiaoyu Song, Guocheng Tao, Yan Yan, P. Tucker, Steven A. E. Miller, Shirui Luo, S. Koric, Weimin Zheng

引用次数: 0

TANGO: re-thinking quantization for graph neural network training on GPUs TANGO:重新思考图形神经网络在gpu上的量化训练

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-08-02 DOI: 10.48550/arXiv.2308.00890

Shiyang Chen, Da Zheng, Caiwen Ding, Chengying Huan, Yuede Ji, Hang Liu

{"title":"TANGO: re-thinking quantization for graph neural network training on GPUs","authors":"Shiyang Chen, Da Zheng, Caiwen Ding, Chengying Huan, Yuede Ji, Hang Liu","doi":"10.48550/arXiv.2308.00890","DOIUrl":"https://doi.org/10.48550/arXiv.2308.00890","url":null,"abstract":"Graph learning is becoming increasingly popular due to its superior performance in tackling many grand challenges. While quantization is widely used to accelerate Graph Neural Network (GNN) computation, quantized training faces remarkable roadblocks. Current quantized GNN training systems often experience longer training time than their full-precision counterparts for two reasons: (i) addressing the quantization accuracy challenge leads to excessive overhead, and (ii) the optimization potential exposed by quantization is not adequately leveraged. This paper introduces Tango which re-thinks quantization challenges and opportunities for graph neural network training on GPUs with three contributions: Firstly, we introduce efficient rules to maintain accuracy during quantized GNN training. Secondly, we design and implement quantization-aware primitives and inter-primitive optimizations to speed up GNN training. Finally, we integrate Tango with the popular Deep Graph Library (DGL) system and demonstrate its superior performance over the state-of-the-art approaches on various GNN models and datasets.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126388205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training DistTGL:基于分布式记忆的时间图神经网络训练

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-07-14 DOI: 10.48550/arXiv.2307.07649

Hongkuan Zhou, Da Zheng, Xiang Song, G. Karypis, V. Prasanna

引用次数: 0

NNQS-Transformer: an Efficient and Scalable Neural Network Quantum States Approach for Ab initio Quantum Chemistry NNQS-Transformer:一种用于从头算量子化学的高效可扩展神经网络量子态方法

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-06-29 DOI: 10.48550/arXiv.2306.16705

Yangjun Wu, Chu Guo, Yi Fan, P. Zhou, Honghui Shang

引用次数: 0

FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs FuzzyFlow:利用数据流来查找和消除程序优化错误

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-06-28 DOI: 10.48550/arXiv.2306.16178

Philipp Schaad, Timo Schneider, Tal Ben-Nun, A. Calotoiu, A. Ziogas, T. Hoefler

{"title":"FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs","authors":"Philipp Schaad, Timo Schneider, Tal Ben-Nun, A. Calotoiu, A. Ziogas, T. Hoefler","doi":"10.48550/arXiv.2306.16178","DOIUrl":"https://doi.org/10.48550/arXiv.2306.16178","url":null,"abstract":"The current hardware landscape and application scale is driving performance engineers towards writing bespoke optimizations. Verifying such optimizations, and generating minimal failing cases, is important for robustness in the face of changing program conditions, such as inputs and sizes. However, isolation of minimal test-cases from existing applications and generating new configurations are often difficult due to side effects on the system state, mostly related to dataflow. This paper introduces FuzzyFlow: a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations to enable fast checking for semantic equivalence. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation. We demonstrate FuzzyFlow on example use cases in real-world applications where the approach provides up to 528 times faster optimization testing and debugging compared to traditional approaches.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130752541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning 幻影:通过强化学习实现批处理GPU集群上的低中断服务

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-06-25 DOI: 10.48550/arXiv.2306.14086

Qi-Dong Ding, Pengfei Zheng, Shreyas Kudari, S. Venkataraman, Zhao-jie Zhang

{"title":"Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning","authors":"Qi-Dong Ding, Pengfei Zheng, Shreyas Kudari, S. Venkataraman, Zhao-jie Zhang","doi":"10.48550/arXiv.2306.14086","DOIUrl":"https://doi.org/10.48550/arXiv.2306.14086","url":null,"abstract":"Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133178732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-grained Policy-driven I/O Sharing for Burst Buffers 突发缓冲区的细粒度策略驱动I/O共享

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-06-20 DOI: 10.48550/arXiv.2306.11615

E. Karrels, Lei Huang, Yuhong Kan, Ishank Arora, Yinzhi Wang, D. Katz, W. Gropp, Zhao Zhang

{"title":"Fine-grained Policy-driven I/O Sharing for Burst Buffers","authors":"E. Karrels, Lei Huang, Yuhong Kan, Ishank Arora, Yinzhi Wang, D. Katz, W. Gropp, Zhao Zhang","doi":"10.48550/arXiv.2306.11615","DOIUrl":"https://doi.org/10.48550/arXiv.2306.11615","url":null,"abstract":"A burst buffer is a common method to bridge the performance gap between the I/O needs of modern supercomputing applications and the performance of the shared file system on large-scale supercomputers. However, existing I/O sharing methods require resource isolation, offline profiling, or repeated execution that significantly limit the utilization and applicability of these systems. Here we present ThemisIO, a policy-driven I/O sharing framework for a remote-shared burst buffer: a dedicated group of I/O nodes, each with a local storage device. ThemisIO preserves high utilization by implementing opportunity fairness so that it can reallocate unused I/O resources to other applications. ThemisIO accurately and efficiently allocates I/O cycles among applications, purely based on real-time I/O behavior without requiring user-supplied information or offline-profiled application characteristics. ThemisIO supports a variety of fair sharing policies, such as user-fair, size-fair, as well as composite policies, e.g., group-then-user-fair. All these features are enabled by its statistical token design. ThemisIO can alter the execution order of incoming I/O requests based on assigned tokens to precisely balance I/O cycles between applications via time slicing, thereby enforcing processing isolation. Experiments using I/O benchmarks show that ThemisIO sustains 13.5--13.7% higher I/O throughput and 19.5--40.4% lower performance variation than existing algorithms. For real applications, ThemisIO significantly reduces the slowdown by 59.1--99.8% caused by I/O interference.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Co-design Hardware and Algorithm for Vector Search 协同设计的矢量搜索硬件和算法

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2023-06-19 DOI: 10.48550/arXiv.2306.11182

Wenqi Jiang, Shigang Li, Yu Zhu, J. D. F. Licht, Zhenhao He, Runbin Shi, Cédric Renggli, Shuai Zhang, Theodoros Rekatsinas, T. Hoefler, G. Alonso

{"title":"Co-design Hardware and Algorithm for Vector Search","authors":"Wenqi Jiang, Shigang Li, Yu Zhu, J. D. F. Licht, Zhenhao He, Runbin Shi, Cédric Renggli, Shuai Zhang, Theodoros Rekatsinas, T. Hoefler, G. Alonso","doi":"10.48550/arXiv.2306.11182","DOIUrl":"https://doi.org/10.48550/arXiv.2306.11182","url":null,"abstract":"Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce FANNS, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, FANNS automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. FANNS attains up to 23.0× and 37.2× speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5× and 7.6× speedup in median and 95th percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of FANNS lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121967396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2