Proceedings of the 34th ACM International Conference on Supercomputing最新文献_第2页

Leveraging intra-page update diversity for mitigating write amplification in SSDs 利用页内更新多样性来减轻ssd中的写放大

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392767

Imran Fareed, Mincheol Kang, Wonyoung Lee, Soontae Kim

{"title":"Leveraging intra-page update diversity for mitigating write amplification in SSDs","authors":"Imran Fareed, Mincheol Kang, Wonyoung Lee, Soontae Kim","doi":"10.1145/3392717.3392767","DOIUrl":"https://doi.org/10.1145/3392717.3392767","url":null,"abstract":"A solid state drive (SSD) receives requests in multiple of sectors from the host system, which are then mapped to logical pages, the basic I/O units of the flash memory. As the SSD receives requests in sector units, the sectors in a logical page tend to exhibit diverse update frequencies. Therefore, frequent updates to some sectors of a page cause other sectors of the same page to be unnecessarily read and written to other free pages, thereby increasing write amplification and harming the flash memory lifetime. To eliminate unnecessary sector movement and to reduce write amplification, we propose a sector-level classification (SLC) technique. SLC considers the diversity in the update frequencies of sectors and merges sectors with similar update frequencies to generate full, homogeneous pages. Thus, multiple update operations can be converged to a single flash page, thereby reducing write amplification and increasing flash memory lifetime. SLC handles the merged sectors using the proposed shared-page mapping table (SMT), whereas pages whose sectors remain unmerged are handled by a conventional page mapping table. Despite the SMT overhead, SLC does not require excessive resources to accommodate SMT. The capability of SLC is evaluated by a series of experiments, which provides highly encouraging results. It is demonstrated that SLC reduces flash writes, flash reads, block erasures, and flash writes execution time by 42%, 23%, 45%, and 37%, respectively.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129774032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Wavefront parallelization of recurrent neural networks on multi-core architectures 多核结构上循环神经网络的波前并行化

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392762

Robin Kumar Sharma, Marc Casas

{"title":"Wavefront parallelization of recurrent neural networks on multi-core architectures","authors":"Robin Kumar Sharma, Marc Casas","doi":"10.1145/3392717.3392762","DOIUrl":"https://doi.org/10.1145/3392717.3392762","url":null,"abstract":"Recurrent neural networks (RNNs) are widely used for natural language processing, time-series prediction, or text analysis tasks. The internal structure of RNNs inference and training in terms of data or control dependencies across their fundamental numerical kernels complicate the exploitation of model parallelism, which is the reason why just data-parallelism has been traditionally applied to accelerate RNNs. This paper presents W-Par (Wavefront-Parallelization), a comprehensive approach for RNNs inference and training on CPUs that relies on applying model parallelism into RNNs models. We use fine-grained pipeline parallelism in terms of wavefront computations to accelerate multi-layer RNNs running on multi-core CPUs. Wavefront computations have been widely applied in many scientific computing domains like stencil kernels or dynamic programming. W-Par divides RNNs workloads across different parallel tasks by defining input and output dependencies for each RNN cell. Our experiments considering different RNNs models demonstrate that W-Par achieves up to 6.6X speed-up for RNN models inference and training in comparison to current state-of-the-art implementations on modern multi-core CPU architectures. Importantly, W-Par maximizes performance on a wide range of scenarios, including different core counts or memory hierarchy configurations, without requiring any change at the source code level.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Accelerating relax-ordered task-parallel workloads using multi-level dependency checking 使用多级依赖检查加速松弛顺序的任务并行工作负载

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392758

Masab Ahmad, Mohsin Shan, Akif Rehman, O. Khan

{"title":"Accelerating relax-ordered task-parallel workloads using multi-level dependency checking","authors":"Masab Ahmad, Mohsin Shan, Akif Rehman, O. Khan","doi":"10.1145/3392717.3392758","DOIUrl":"https://doi.org/10.1145/3392717.3392758","url":null,"abstract":"Work-efficient task-parallel algorithms enforce ordered execution of tasks using priority schedulers. These algorithms suffer from limited parallelism due to data movement and synchronization bottlenecks. State-of-the-art priority schedulers relax the ordering of tasks to avoid false dependencies generated by strict queuing constraints, thus unlocking task parallelism. However, relaxing task dependencies results in shared data races among cores that lead to redundant task computations in concurrently executing threads. Although static algorithm optimizations have been shown to reduce redundant work, they do not exploit the tradeoff between parallelism and work efficiency that is only exposed during runtime. This paper proposes a task dependency checking mechanism that dynamically tracks the monotonic property of parent-child relationships across multiple levels from any given task. Since shared memory writes are known to be slower than concurrent reads, the multi-level checks effectively detect task dependency races to prune redundant tasks. Evaluation of relax-ordered algorithms on a 40-core Intel Xeon multicore shows an average of 44% performance improvement over the Galois obim scheduler.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116557427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Tuning applications for efficient GPU offloading to in-memory processing 调优应用程序以实现高效的GPU卸载到内存处理

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392760

Yudong Wu, Mingyao Shen, Yi-Hui Chen, Yuanyuan Zhou

{"title":"Tuning applications for efficient GPU offloading to in-memory processing","authors":"Yudong Wu, Mingyao Shen, Yi-Hui Chen, Yuanyuan Zhou","doi":"10.1145/3392717.3392760","DOIUrl":"https://doi.org/10.1145/3392717.3392760","url":null,"abstract":"Data movement between processors and main memory is a critical bottleneck for data-intensive applications. This problem is more severe with Graphics Processing Units (GPUs) applications due to their massive parallel data processing characteristics. Recent research has shown that in-memory processing can greatly alleviate this data movement bottleneck by reducing traffic between GPUs and memory devices. It offloads execution to in-memory processors, and avoids transferring enormous data between memory devices and processors. However, while in-memory processing is promising, to fully take advantage of such architecture, we need to solve several issues. For example, the conventional GPU application code that is highly optimized for the locality to execute efficiently in GPU does not necessarily have good locality for in-memory processing. As such, the GPU may mistakenly offload application routines that cannot gain benefit from in-memory processing. Additionally, workload balancing cannot simply treat in-memory processors as GPU processors since its data transfer time can be significantly reduced. Finally, how to offload application routines that access the shared memory inside GPUs is still an unsolved issue. In this paper, we explore four optimizations for GPU applications to take advantage of in-memory processors. Specifically, we propose four optimizations: application restructuring, run-time adaptation, aggressive loop offloading, and shared-memory transfer on-demand to mitigate the four unsolved issues in the GPU in-memory processing system. From our experimental evaluations with 13 applications, our approach can achieve 2.23x offloading performance improvement.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130202060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

CodeSeer

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392741

Tao Wang, Nikhil Jain, David Boehme, D. Beckingsale, F. Mueller, T. Gamblin

{"title":"CodeSeer","authors":"Tao Wang, Nikhil Jain, David Boehme, D. Beckingsale, F. Mueller, T. Gamblin","doi":"10.1145/3392717.3392741","DOIUrl":"https://doi.org/10.1145/3392717.3392741","url":null,"abstract":"In high performance computing (HPC), scientific simulation codes are executed repeatedly with different inputs. The peak performance of these programs heavily depends on various compiler optimizations, which are often selected agnostically on program input or may be selected with sensitivity to just a single input. When subsequently executed, often with different inputs, performance may suffer for all or all but the one input tested, and for the latter potentially even compared to the O3 baseline. This work proposes a new auto-tuning framework, CodeSeer, to assess and improve existing input-agnostic or single-input centric rigid application tuning methods. Aided by CodeSeer, it is observed that modern HPC programs expose different types of input sensitivities, which present a significant challenge for prior work. To tackle this problem, CodeSeer proceeds with several machine learning models to predict the best per-input code variant on-the-fly. Our evaluation shows that CodeSeer incurs less than 0.01 second overhead, predicts the best code variant with a geometric mean precision 92% of the time and is capable of improving per-input peak performance to unprecedented levels.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117291151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SB-Fetch: synchronization aware hardware prefetching for chip multiprocessors SB-Fetch:芯片多处理器的同步感知硬件预取

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392735

Laith M. AlBarakat, Paul V. Gratz, Daniel A. Jiménez

{"title":"SB-Fetch: synchronization aware hardware prefetching for chip multiprocessors","authors":"Laith M. AlBarakat, Paul V. Gratz, Daniel A. Jiménez","doi":"10.1145/3392717.3392735","DOIUrl":"https://doi.org/10.1145/3392717.3392735","url":null,"abstract":"Shared-memory, multi-threaded applications often require programmers to insert thread synchronization primitives (i.e. locks, barriers, and condition variables) in critical sections to synchronize data access between processes. Scaling performance requires balanced per-thread workloads with little time spent in critical sections. In practice, however, threads often waste time waiting to acquire locks/barriers, leading to thread imbalance and poor performance scaling. Moreover, critical sections often stall data prefetchers that mitigate the effects of waiting by ensuring data is preloaded in core caches when the critical section is done. This paper introduces a pure hardware technique to enable safe data prefetching beyond synchronization points in chip multiprocessors (CMPs). We show that successful prefetching beyond synchronization points requires overcoming two significant challenges in existing techniques. First, typical prefetchers are designed to trigger prefetches based on current misses. Unlike cores in single-threaded applications, a multi-threaded core stall on a synchronization point does not produce new references to trigger a prefetcher. Second, even if a prefetch were correctly directed to read beyond a synchronization point, it will likely prefetch shared data from another core before this data has been written. This prefetch would be considered \"accurate\" but highly undesirable because it would lead to three extra \"ping-pong\" movements due to coherence, costing more latency and energy than without prefetching. We develop a new data prefetcher, Synchronization-aware B-Fetch (SB-Fetch), built as an extension to a previous single-threaded data prefetcher. SB-Fetch addresses both issues for shared memory multi-threaded workloads. The novelty in SB-Fetch is that it explicitly issues prefetches for data beyond synchronization points and it distinguishes between data likely and unlikely to incur cache coherence overhead. These two features are directly synergistic since blindly prefetching beyond synchronization is likely to incur coherence penalties. No prior work includes both features. SB-Fetch is evaluated using a representative set of benchmarks from Parsec [4], Rodinia [7], and Parboil [39]. SB-Fetch improves execution time by 12.3% over baseline and 4% over best of class prefetching.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123338150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TensorSVM TensorSVM

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392770

Shaoshuai Zhang, Ruchi Shah, Panruo Wu

{"title":"TensorSVM","authors":"Shaoshuai Zhang, Ruchi Shah, Panruo Wu","doi":"10.1145/3392717.3392770","DOIUrl":"https://doi.org/10.1145/3392717.3392770","url":null,"abstract":"This paper explores the use of Tensor Engines to accelerate nonlinear and linear SVM training. Support Vector Machine(SVM) is a classical machine learning model for classification and regression and remains to be the state-of-the-art model for some tasks such as text classification and bioinformatics. However large scale SVM training is still challenging because of its high computational complexity. This is especially severe for non-linear SVM with kernel tricks. On the other hand, the surging importance of neural networks fuels the emergence of specialized processors called Tensor Units (TensorCore in GPU and Tensor Processing Unit of Google) which are characterized by extreme efficiency and very limited precision and range. This paper proposes a TensorCore GPU based SVM algorithm and software system that is faster and more scalable than state-of-the-art SVM solvers. It includes a fast, accurate low-rank Gram matrix approximation that effectively utilizes the TensorCore in GPU and a primal-dual interior-point method to solve the quadratic program with a fast and predictable convergence rate. The random projection based Gram matrix approximation can be substantially accelerated by TensorCore on GPU. This exploration ends up with a tale of randomized numerical linear algebra, convex optimization, and high performance computing on Tensor Engines. Particularly, this paper suggests that the emerging randomized numerical linear algebra algorithms and Tensor Engines are synergistic in opening up exciting new application areas that include statistical machine learning and the wider scientific/engineering computing.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129889509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Global link arrangement for practical Dragonfly 实际蜻蜓的全球链接安排

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392756

Zaid Alzaid, Saptarshi Bhowmik, Xin Yuan, M. Lang

{"title":"Global link arrangement for practical Dragonfly","authors":"Zaid Alzaid, Saptarshi Bhowmik, Xin Yuan, M. Lang","doi":"10.1145/3392717.3392756","DOIUrl":"https://doi.org/10.1145/3392717.3392756","url":null,"abstract":"The Dragonfly network organizes routers into groups, with connectivity within each group provided by local links and connectivity between groups provided by global links. The specification of Dragonfly leaves many options for arranging global links. In this work, we study global link arrangement for practical Dragonfly topologies where (1) there are multiple global links connecting each pair of groups, and (2) the global link bandwidth is similar to the local link bandwidth. We found that existing global link arrangement schemes such as the absolute, relative and circulant-based arrangements do not specify an important component in global connectivity for practical Dragonfly, which we call per-router arrangement. Per-router arrangement determines how the global links from each individual router are connected. We integrate per-router arrangement into existing schemes, develop a unified algorithm to compute a large class of global link arrangements for practical Dragonfly, and carry out an extensive simulation study to evaluate different global link arrangement schemes. Our results indicate that the existing understanding of the global link arrangement does not apply to practical Dragonfly: contradict to the existing understanding that global link arrangement does not make significant difference in performance when global links have similar bandwidth as local links, per-router arrangement significantly impacts the network performance of practical Dragonfly. We identify the schemes that yield high performance for practical Dragonfly.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles 使用重用配置文件对gpgpu进行快速、准确和可扩展的内存建模

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392761

Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Atanu Barai, N. Santhi, S. Eidenbenz

{"title":"Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles","authors":"Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Atanu Barai, N. Santhi, S. Eidenbenz","doi":"10.1145/3392717.3392761","DOIUrl":"https://doi.org/10.1145/3392717.3392761","url":null,"abstract":"In this paper, we introduce an accurate and scalable memory modeling framework for General Purpose Graphics Processor units (GPGPUs), PPT-GPU-Mem. That is Performance Prediction Tool-Kit for GPUs Cache Memories. PPT-GPU-Mem predicts the performance of different GPUs' cache memory hierarchy (L1 & L2) based on reuse profiles. We extract a memory trace for each GPU kernel once in its lifetime using the recently released binary instrumentation tool, NVBIT. The memory trace extraction is architecture-independent and can be done on any available NVIDIA GPU. PPT-GPU-Mem can then model any NVIDIA GPU caches given their parameters and the extracted memory trace. We model Volta Tesla V100 and Turing TITAN RTX and validate our framework using different kernels from Polybench and Rodinia benchmark suites in addition to two deep learning applications from Tango DNN benchmark suite. We provide two models, MBRDP (Multiple Block Reuse Distance Profile) and OBRDP (One Block Reuse Distance Profile), with varying assumptions, accuracy, and speed. Our accuracy ranges from 92% to 99% for the different cache levels compared to real hardware while maintaining the scalability in producing the results. Finally, we illustrate that PPT-GPU-Mem can be used for design space exploration and for predicting the cache performance of future GPUs.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126067968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

What every scientific programmer should know about compiler optimizations? 关于编译器优化，每个程序员都应该知道些什么?

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392754

Jialiang Tan, Shuyin Jiao, Milind Chabbi, Xu Liu

{"title":"What every scientific programmer should know about compiler optimizations?","authors":"Jialiang Tan, Shuyin Jiao, Milind Chabbi, Xu Liu","doi":"10.1145/3392717.3392754","DOIUrl":"https://doi.org/10.1145/3392717.3392754","url":null,"abstract":"Compilers are an indispensable component in the software stack. Besides generating machine code, compilers perform multiple optimizations to improve code performance. Typically, scientific programmers treat compilers as a blackbox and expect them to optimize code thoroughly. However, optimizing compilers are not performance panacea. They can miss optimization opportunities or even introduce inefficiencies that are not in the source code. There is a lack of tool infrastructures and datasets that can provide such a study to help understand compiler optimizations. In this paper, we investigate an important compiler optimization---dead and redundant operation elimination. We first develop a tool CIDetector to analyze a large number of programs. In our analysis, we select 12 representative programs from different domains to form a dataset called CIBench. We utilize five compilers to optimize CIBench with the highest optimization options available and leverage CIDetector to study each generated binary. We provide insights into two aspects. First, we show that modern compilers miss several optimization opportunities, in fact they even introduce some inefficiencies, which require programmers to refactor the source code. Second, we show how compilers have advanced in a vertical evolution (the same compiler of different release versions) and a horizontal comparison (different compilers of the most recent releases). With empirical studies, we provide insights for software engineers, compiler writers, and tool developers.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123469604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12