SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks 重新思考高性能可重构网络中的路由设计

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00029

Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman

引用次数: 6

VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers VERITAS:在有噪声的中等规模量子计算机上准确估计正确输出

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00019

Tirthak Patel, Devesh Tiwari

引用次数: 23

Scaling the Hartree-Fock Matrix Build on Summit 在顶点上扩展Hartree-Fock矩阵

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00085

Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon

{"title":"Scaling the Hartree-Fock Matrix Build on Summit","authors":"Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon","doi":"10.1109/SC41405.2020.00085","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00085","url":null,"abstract":"Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130510659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based Clusters DRCCTPROF:基于arm集群的细粒度调用路径分析器

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00034

Qidong Zhao, Xu Liu, Milind Chabbi

引用次数: 2

Co-Design for A64FX Manycore Processor and ”Fugaku” A64FX多核处理器与“Fugaku”的协同设计

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00051

M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu

{"title":"Co-Design for A64FX Manycore Processor and ”Fugaku”","authors":"M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu","doi":"10.1109/SC41405.2020.00051","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00051","url":null,"abstract":"We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"315 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Herring: Rethinking the Parameter Server at Scale for the Cloud 鲱鱼:重新考虑云计算的规模参数服务器

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00048

Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce

{"title":"Herring: Rethinking the Parameter Server at Scale for the Cloud","authors":"Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce","doi":"10.1109/SC41405.2020.00048","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00048","url":null,"abstract":"Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like $mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{r}mathrm{g}mathrm{e}}$ using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132181981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs GPU- trident: GPU程序中误差传播的有效建模

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00092

Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari

引用次数: 10

Processing Full-Scale Square Kilometre Array Data on the Summit Supercomputer 在顶峰超级计算机上处理全尺寸平方公里阵列数据

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00006

Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky

引用次数: 8

Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks 面向可重构全对全HPC网络的3D-Hyper-FleX-LION架构与性能研究

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00030

Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo

{"title":"Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks","authors":"Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo","doi":"10.1109/SC41405.2020.00030","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00030","url":null,"abstract":"While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, $3times$ lower power consumption, $10 times$ fewer cable connections, and $4 times$ reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to $4 times$ improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance 一种2:1八叉树平衡最小同步算法的评估

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00027

Hansol Suh, T. Isaac

引用次数: 0