ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献

筛选
英文 中文
ALTO
A. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi
{"title":"ALTO","authors":"A. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi","doi":"10.1145/3447818.3461703","DOIUrl":"https://doi.org/10.1145/3447818.3461703","url":null,"abstract":"The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8x over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.x relative to the best mode-specific (compressed sparse fiber) formats.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":" 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91412390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
μSteal
Amirhossein Mirhosseini, T. Wenisch
{"title":"μSteal","authors":"Amirhossein Mirhosseini, T. Wenisch","doi":"10.1145/3447818.3463529","DOIUrl":"https://doi.org/10.1145/3447818.3463529","url":null,"abstract":"Modern internet services are moving towards distributed microservice architectures, wherein a complex application is decomposed into numerous discrete microservices to improve programmability, reliability, manageability, and scalability. A key property of microservice-based architectures is that common microservices may be shared by multiple end-to-end cloud services. As an example, a speech-recognition microservice might serve as an early node in the microservice graphs of several end-to-end services. However, given the dissimilarities across microservice graphs and varying end-to-end latency constraints across services, shared microservices may need to operate under differing latency constraints for each service. As a result, in existing systems, most providers either deploy multiple instance pools for each latency constraint, or require all requests to needlessly meet the most stringent constraint. In this paper, we argue that sharing microservice instances across multiple services can reduce significantly the number of instances, especially under highly asymmetric latency constraints. We propose a request scheduling mechanism, called μSteal, which leverages preemptive work and resource stealing to schedule the arriving requests to cores within a ``mixed-criticality'' microservice instance. μSteal provisions ``core reservations'' for each request class based on their latency requirements, but allows a class to steal cores from other classes if they would otherwise remain idle. But, when a class requires its full reservation, μSteal preempts stolen cores, returning them to their reserved class. μSteal employs a runtime feedback controller augmented by a queuing theory-based analytical model to tune core reservations across classes, seeking to maximize the request throughput within each instance while meeting all classes' latency constraints. We show that μSteal reduces required instances for several shared microservice deployments by 1.29x as compared to deploying multiple, segregated instance pools.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77940233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference SumMerge:权重重复感知深度神经网络推理的有效算法和实现
Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher
{"title":"SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference","authors":"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher","doi":"10.1145/3447818.3460375","DOIUrl":"https://doi.org/10.1145/3447818.3460375","url":null,"abstract":"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91064216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Does it matter?: OMPSanitizer: an impact analyzer of reported data races in OpenMP programs 这有关系吗?OMPSanitizer: OpenMP程序中报告的数据竞争的影响分析器
Wenwen Wang, Pei-Hung Lin
{"title":"Does it matter?: OMPSanitizer: an impact analyzer of reported data races in OpenMP programs","authors":"Wenwen Wang, Pei-Hung Lin","doi":"10.1145/3447818.3460379","DOIUrl":"https://doi.org/10.1145/3447818.3460379","url":null,"abstract":"Data races are a primary source of concurrency bugs in parallel programs. Yet, debugging data races is not easy, even with a large amount of data race detection tools. In particular, there still exists a manually-intensive and time-consuming investigation process after data races are reported by existing race detection tools. To address this issue, we present OMPSanitizer in this paper. OMPSanitizer employs a novel and semantic-aware impact analysis mechanism to assess the potential impact of detected data races so that developers can focus on data races with a high probability to produce a harmful impact. This way, OMPSanitizer can remove the heavy debugging burden of data races from developers and simultaneously enhance the debugging efficiency. We have implemented OMPSanitizer based on the widely-used dynamic binary instrumentation infrastructure, Intel Pin. Our evaluation results on a broad range of OpenMP programs from the DataRaceBench benchmark suite and an ECP Proxy application demonstrate that OMPSanitizer can precisely report the impact of data races detected by existing race detectors, e.g., Helgrind and ThreadSanitizer. We believe OMPSanitizer will provide a new perspective on automating the debugging support for data races in OpenMP programs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"150 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74736179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A practical tile size selection model for affine loop nests 一个实用的仿射环巢瓷砖尺寸选择模型
Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula
{"title":"A practical tile size selection model for affine loop nests","authors":"Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula","doi":"10.1145/3447818.3462213","DOIUrl":"https://doi.org/10.1145/3447818.3462213","url":null,"abstract":"Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86228203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A performance portability framework for Python Python的性能可移植性框架
Nader Al Awar, Steven Zhu, G. Biros, Miloš Gligorić
{"title":"A performance portability framework for Python","authors":"Nader Al Awar, Steven Zhu, G. Biros, Miloš Gligorić","doi":"10.1145/3447818.3460376","DOIUrl":"https://doi.org/10.1145/3447818.3460376","url":null,"abstract":"Kokkos is a programming model for writing performance portable applications for all major high performance computing platforms. It provides abstractions for data management and common parallel operations, allowing developers to write portable high performance code with minimal knowledge of architecture-specific details. Kokkos is implemented as a heavily-templated C++ library. However, C++ is not ideal for rapid prototyping and quick algorithmic exploration. An increasing number of developers use Python for scientific computing, machine learning, and data analytics. In this paper, we present a new Python framework, dubbed PyKokkos, for writing performance portable applications entirely in Python. PyKokkos provides Kokkos-like abstractions that are easier to use and more concise than the C++ interface. We implemented PyKokkos by building a translator from a subset of Python to C++ Kokkos and bridging necessary function calls via automatically generated Python bindings. PyKokkos is also compatible with NumPy, a widely-used high performance Python library. By porting several existing Kokkos applications to PyKokkos, including ExaMiniMD (∼3k lines of code in C++), we show that the latter can achieve efficient execution with low performance overhead.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83282241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Topology-aware optimizations for multi-GPU ptychographic image reconstruction 多gpu平面图像重建的拓扑感知优化
Xiaodong Yu, Tekin Bicer, R. Kettimuthu, Ian T Foster
{"title":"Topology-aware optimizations for multi-GPU ptychographic image reconstruction","authors":"Xiaodong Yu, Tekin Bicer, R. Kettimuthu, Ian T Foster","doi":"10.1145/3447818.3460380","DOIUrl":"https://doi.org/10.1145/3447818.3460380","url":null,"abstract":"Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high- (e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13× and 1.63× communication and end-to-end application speedups, respectively.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"514 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80097226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Athena: high-performance sparse tensor contraction sequence on heterogeneous memory Athena:异构存储器上的高性能稀疏张量收缩序列
Jiawen Liu, Dong Li, R. Gioiosa, Jiajia Li
{"title":"Athena: high-performance sparse tensor contraction sequence on heterogeneous memory","authors":"Jiawen Liu, Dong Li, R. Gioiosa, Jiajia Li","doi":"10.1145/3447818.3460355","DOIUrl":"https://doi.org/10.1145/3447818.3460355","url":null,"abstract":"Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86711076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
PLANAR 平面
Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó
{"title":"PLANAR","authors":"Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó","doi":"10.1007/978-3-642-41714-6_162253","DOIUrl":"https://doi.org/10.1007/978-3-642-41714-6_162253","url":null,"abstract":"","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83272279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Omegaflow Omegaflow
Yaoyang Zhou, Zihao Yu, Chuanqi Zhang, Yinan Xu, Huizhe Wang, Sa Wang, Ninghui Sun, Yungang Bao
{"title":"Omegaflow","authors":"Yaoyang Zhou, Zihao Yu, Chuanqi Zhang, Yinan Xu, Huizhe Wang, Sa Wang, Ninghui Sun, Yungang Bao","doi":"10.1145/3447818.3460367","DOIUrl":"https://doi.org/10.1145/3447818.3460367","url":null,"abstract":"This paper investigates how to better track and deliver dependency in dependency-based cores to exploit instruction-level parallelism (ILP) as much as possible. To this end, we first propose an analytical performance model for the state-of-art dependency-based core, Forwardflow, and figure out two vital factors affecting its upper bound of performance. Then we propose Omegaflow,a dependency-based architecture adopting three new techniques, which respond to the discovered factors. Experimental results show that Omegaflow improves IPC by 24.6% compared to the state-of-the-art design, approaching the performance of the OoO architecture with an ideal scheduler (94.4%) without increasing the clock cycle and consumes only 8.82% more energy than Forwardflow.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"117 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81779140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信