SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

筛选
英文 中文
TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks 重新思考高性能可重构网络中的路由设计
Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman
{"title":"TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks","authors":"Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman","doi":"10.1109/SC41405.2020.00029","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00029","url":null,"abstract":"Many reconfigurable network topologies have been proposed in the past. However, efficient routing on top of these flexible interconnects still presents a challenge. In this work, we reevaluate key principles that have guided the designs of many routing protocols on static networks, and see how well those principles apply on reconfigurable network topologies. Based on a theoretical analysis of key properties that routing in a reconfigurable network should satisfy to maximize performance, we propose a topology-aware, globally-direct oblivious (TAGO) routing protocol for reconfigurable topologies. Our proposed routing protocol is simple in design and yet, when deployed in conjunction with a reconfigurable network topology, improves throughput by up to $2.2 times$ compared to established routing protocols and even comes within 10% of the throughput of impractical adaptive routing that has instant global congestion information.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117163283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers VERITAS:在有噪声的中等规模量子计算机上准确估计正确输出
Tirthak Patel, Devesh Tiwari
{"title":"VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers","authors":"Tirthak Patel, Devesh Tiwari","doi":"10.1109/SC41405.2020.00019","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00019","url":null,"abstract":"Noisy Intermediate-Scale Quantum (NISQ) machines are being increasingly used to develop quantum algorithms and establish use cases for quantum computing. However, these devices are highly error-prone and produce output, which can be far from the correct output of the quantum algorithm. In this paper, we propose VERITAS, an end-to-end approach toward designing quantum experiments, executing experiments, and correcting outputs produced by quantum circuits post their execution such that the correct output of the quantum algorithm can be accurately estimated.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126134600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Scaling the Hartree-Fock Matrix Build on Summit 在顶点上扩展Hartree-Fock矩阵
Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon
{"title":"Scaling the Hartree-Fock Matrix Build on Summit","authors":"Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon","doi":"10.1109/SC41405.2020.00085","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00085","url":null,"abstract":"Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130510659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based Clusters DRCCTPROF:基于arm集群的细粒度调用路径分析器
Qidong Zhao, Xu Liu, Milind Chabbi
{"title":"DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based Clusters","authors":"Qidong Zhao, Xu Liu, Milind Chabbi","doi":"10.1109/SC41405.2020.00034","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00034","url":null,"abstract":"ARM is an attractive CPU architecture for exascale systems because of its energy efficiency. As a recent entry into the HPC paradigm, ARM lags in its software stack, especially in the performance tooling aspect. Notably, there is a lack of fine-grained measurement tools to analyze fully optimized HPC binary executables on ARM processors. In this paper, we introduce DRCCTPROF — a fine-grained call path profiling framework for binaries running on ARM architectures. The unique ability of DRCCTPROF is to obtain full calling context at any and every machine instruction that executes, which provides more detailed diagnostic feedback for performance optimization and correctness tools. Furthermore, DRCCTPROF not only associates any instruction with source code along the call path, but also associates memory access instructions back to the constituent data object. Finally, DRCCTPROF incurs moderate overhead and provides a compact view to visualize the profiles collected from parallel executions.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130514122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Co-Design for A64FX Manycore Processor and ”Fugaku” A64FX多核处理器与“Fugaku”的协同设计
M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu
{"title":"Co-Design for A64FX Manycore Processor and ”Fugaku”","authors":"M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu","doi":"10.1109/SC41405.2020.00051","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00051","url":null,"abstract":"We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"315 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Herring: Rethinking the Parameter Server at Scale for the Cloud 鲱鱼:重新考虑云计算的规模参数服务器
Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce
{"title":"Herring: Rethinking the Parameter Server at Scale for the Cloud","authors":"Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce","doi":"10.1109/SC41405.2020.00048","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00048","url":null,"abstract":"Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like $mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{r}mathrm{g}mathrm{e}}$ using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132181981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs GPU- trident: GPU程序中误差传播的有效建模
Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari
{"title":"GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs","authors":"Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari","doi":"10.1109/SC41405.2020.00092","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00092","url":null,"abstract":"Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1756 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Processing Full-Scale Square Kilometre Array Data on the Summit Supercomputer 在顶峰超级计算机上处理全尺寸平方公里阵列数据
Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky
{"title":"Processing Full-Scale Square Kilometre Array Data on the Summit Supercomputer","authors":"Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky","doi":"10.1109/SC41405.2020.00006","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00006","url":null,"abstract":"This work presents a workflow for simulating and processing the full-scale low-frequency telescope data of the Square Kilometre Array (SKA) Phase 1. The SKA project will enter the construction phase soon, and once completed, it will be the world’s largest radio telescope and one of the world’s largest data generators. The authors used Summit to mimic an endto-end SKA workflow, simulating a dataset of a typical 6 hour observation and then processing that dataset with an imaging pipeline. This workflow was deployed and run on 4,560 compute nodes, and used 27,360 GPUs to generate 2.6 PB of data. This was the first time that radio astronomical data were processed at this scale. Results show that the workflow has the capability to process one of the key SKA science cases, an Epoch of Reionization observation. This analysis also helps reveal critical design factors for the next-generation radio telescopes and the required dedicated processing facilities.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127208082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks 面向可重构全对全HPC网络的3D-Hyper-FleX-LION架构与性能研究
Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo
{"title":"Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks","authors":"Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo","doi":"10.1109/SC41405.2020.00030","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00030","url":null,"abstract":"While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, $3times$ lower power consumption, $10 times$ fewer cable connections, and $4 times$ reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to $4 times$ improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance 一种2:1八叉树平衡最小同步算法的评估
Hansol Suh, T. Isaac
{"title":"Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance","authors":"Hansol Suh, T. Isaac","doi":"10.1109/SC41405.2020.00027","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00027","url":null,"abstract":"The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122521682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书