Markos Kynigos, J. A. Pascual, J. Navaridas, J. Goodacre, M. Luján
{"title":"Power and energy efficient routing for Mach-Zehnder interferometer based photonic switches","authors":"Markos Kynigos, J. A. Pascual, J. Navaridas, J. Goodacre, M. Luján","doi":"10.1145/3447818.3460363","DOIUrl":"https://doi.org/10.1145/3447818.3460363","url":null,"abstract":"Silicon Photonic top-of-rack (ToR) switches are highly desirable for the datacenter (DC) and high-performance computing (HPC) domains for their potential high-bandwidth and energy efficiency. Recently, photonic Beneš switching fabrics based on Mach-Zehnder Interferometers (MZIs) have been proposed as a promising candidate for the internals of high-performance switches. However, state-of-the-art routing algorithms that control these switching fabrics are either computationally complex or unable to provide non-blocking, energy efficient routing permutations.To address this, we propose for the first time a combination of energy efficient routing algorithms and time-division multiplexing (TDM). We evaluate this approach by conducting a simulation-based performance evaluation of a 16x16 Beneš fabric, deployed as a ToR switch, when handling a set of 8 representative workloads from the DC and HPC domains. Our results show that state-of-the-art approaches (circuit switched energy efficient routing algorithms) introduce up to 23% contention in the switching fabric for some workloads, thereby increasing communication time. We show that augmenting the algorithms with TDM can ameliorate switch fabric contention by segmenting communication data and gracefully interleaving the segments, thus reducing communication time by up to 20% in the best case. We also discuss the impact of the TDM segment size, finding that although a 10KB segment size is the most beneficial in reducing communication time, a 100KB segment size offers similar performance while requiring a less stringent path-computation time window. Finally, we assess the impact of TDM on path-dependent insertion loss and switching energy consumption, finding it to be minimal in all cases.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"82 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72871757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy","authors":"Jie Ren, Jiaolin Luo, I. Peng, Kai Wu, Dong Li","doi":"10.1145/3447818.3460356","DOIUrl":"https://doi.org/10.1145/3447818.3460356","url":null,"abstract":"Particle simulations of plasma are important for understanding plasma dynamics in space weather and fusion devices. However, production simulations that use billions and even trillions of computational particles require high memory capacity. In this work, we explore the latest persistent memory (PM) hardware to enable large-scale plasma simulations at unprecedented scales on a single machine. We use WarpX, an advanced plasma simulation code which is mission-critical and targets future exascale systems. We analyze the performance of WarpX on PM-based heterogeneous memory systems and propose to make the best use of memory hierarchy to avoid the impact of inferior performance of PM. We introduce a combination of static and dynamic data placement, and processor-cache prefetch mechanism for performance optimization. We develop a performance model to enable efficient data migration between PM and DRAM in the background, without reducing available bandwidth and parallelism to the application threads. We also build an analytical model to decide when to prefetch for the best use of caches. Our design achieves 66.4% performance improvement over the PM-only baseline and outperforms DRAM-cached, NUMA first-touch, and a state-of-the-art software solution by 38.8%, 45.1% and 83.3%, respectively.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88079377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Doru-Thom Popovici, A. Canning, Zhengji Zhao, Lin-wang Wang, J. Shalf
{"title":"A systematic approach to improving data locality across Fourier transforms and linear algebra operations","authors":"Doru-Thom Popovici, A. Canning, Zhengji Zhao, Lin-wang Wang, J. Shalf","doi":"10.1145/3447818.3460354","DOIUrl":"https://doi.org/10.1145/3447818.3460354","url":null,"abstract":"The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83388020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian
{"title":"An optimized tensor completion library for multiple GPUs","authors":"Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian","doi":"10.1145/3447818.3460692","DOIUrl":"https://doi.org/10.1145/3447818.3460692","url":null,"abstract":"Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75905281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inter-loop optimization in RAJA using loop chains","authors":"Brandon Neth, T. Scogland, B. Supinski, M. Strout","doi":"10.1145/3447818.3461665","DOIUrl":"https://doi.org/10.1145/3447818.3461665","url":null,"abstract":"Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80901068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuan Huang, Pavol Klacansky, Steve Petruzza, A. Gyulassy, P. Bremer, Valerio Pascucci
{"title":"Distributed merge forest: a new fast and scalable approach for topological analysis at scale","authors":"Xuan Huang, Pavol Klacansky, Steve Petruzza, A. Gyulassy, P. Bremer, Valerio Pascucci","doi":"10.1145/3447818.3460358","DOIUrl":"https://doi.org/10.1145/3447818.3460358","url":null,"abstract":"Topological analysis is used in several domains to identify and characterize important features in scientific data, and is now one of the established classes of techniques of proven practical use in scientific computing. The growth in parallelism and problem size tackled by modern simulations poses a particular challenge for these approaches. Fundamentally, the global encoding of topological features necessitates interprocess communication that limits their scaling. In this paper, we extend a new topological paradigm to the case of distributed computing, where the construction of a global merge tree is replaced by a distributed data structure, the merge forest, trading slower individual queries on the structure for faster end-to-end performance and scaling. Empirically, the queries that are most negatively affected also tend to have limited practical use. Our experimental results demonstrate the scalability of both the merge forest construction and the parallel queries needed in scientific workflows, and contrast this scalability with the two established alternatives that construct variations of a global tree.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87150732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the ACM International Conference on Supercomputing","authors":"","doi":"10.1145/3447818","DOIUrl":"https://doi.org/10.1145/3447818","url":null,"abstract":"","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91040679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Ziogas, Tal Ben-Nun, Timo Schneider, T. Hoefler
{"title":"NPBench","authors":"A. Ziogas, Tal Ben-Nun, Timo Schneider, T. Hoefler","doi":"10.1145/3447818.3460360","DOIUrl":"https://doi.org/10.1145/3447818.3460360","url":null,"abstract":"Python, already one of the most popular languages for scientific computing, has made significant inroads in High Performance Computing (HPC). At the center of Python's ecosystem is NumPy, an efficient implementation of the multi-dimensional array (tensor) structure, together with basic arithmetic and linear algebra. Compared to traditional HPC languages, the relatively low performance of Python and NumPy has spawned significant research in compilers and frameworks that decouple Python's compact representation from the underlying implementation. However, it is challenging to compare language compatibility and performance among different frameworks and architectures without a standard set of benchmarks and metrics. To that end, we introduce NPBench, a set of NumPy code samples representing a large variety of HPC applications. We use NPBench to test popular NumPy-accelerating compilers and frameworks on a variety of metrics. NPBench will guide both end-users and framework developers focusing on performance and will drive further use of Python in the high-performance scientific domains.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"25 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77389894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MohammadHossein Olyaiy, Christopher Ng, Mieszko Lis
{"title":"Accelerating DNNs inference with predictive layer fusion","authors":"MohammadHossein Olyaiy, Christopher Ng, Mieszko Lis","doi":"10.1145/3447818.3460378","DOIUrl":"https://doi.org/10.1145/3447818.3460378","url":null,"abstract":"Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multi-channel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how emph{fusing} the layers inside these blocks can dramatically reduce the multiplication count (by 6--20x) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as ARCHON, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8x faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1x faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75584350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç
{"title":"Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication","authors":"Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç","doi":"10.1145/3447818.3461472","DOIUrl":"https://doi.org/10.1145/3447818.3461472","url":null,"abstract":"Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87780382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}