Proceedings of the 2018 International Conference on Supercomputing最新文献

筛选
英文 中文
PFault PFault
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205302
Jinrui Cao, Om Rameshwar Gatla, Mai Zheng, Dong Dai, Vidya Eswarappa, Yan Mu, Yong Chen
{"title":"PFault","authors":"Jinrui Cao, Om Rameshwar Gatla, Mai Zheng, Dong Dai, Vidya Eswarappa, Yan Mu, Yong Chen","doi":"10.1145/3205289.3205302","DOIUrl":"https://doi.org/10.1145/3205289.3205302","url":null,"abstract":"High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121160325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs 多任务gpu中有效SM划分的分类驱动搜索
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205311
Xia Zhao, Zhiying Wang, L. Eeckhout
{"title":"Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs","authors":"Xia Zhao, Zhiying Wang, L. Eeckhout","doi":"10.1145/3205289.3205311","DOIUrl":"https://doi.org/10.1145/3205289.3205311","url":null,"abstract":"Graphics processing units (GPUs) feature an increasing number of streaming multiprocessors (SMs) with each successive generation. At the same time, GPUs are increasingly widely adopted in cloud services and data centers to accelerate general-purpose workloads. Running multiple applications on a GPU in such environments requires effective multitasking support. Spatial multitasking in which independent applications co-execute on different sets of SMs is a promising solution to share GPU resources. Unfortunately, how to effectively partition SMs is an open problem. In this paper, we observe that compared to widely-used even partitioning, dynamic SM partitioning based on the characteristics of the co-executing applications can significantly improve performance and power efficiency. Unfortunately finding an effective SM partition is challenging because the number of possible combinations increases exponentially with the number of SMs and co-executing applications. Through offline analysis, we find that first classifying workloads, and then searching an effective SM partition based on the workload characteristics can significantly reduce the search space, making dynamic SM partitioning tractable. Based on these insights, we propose Classification-Driven search (CD-search) for low-overhead dynamic SM partitioning in multitasking GPUs. CD-search first classifies workloads using a novel off-SM bandwidth model, after which it enters the performance mode or power mode depending on the workload's characteristics. Both modes follow a specific search strategy to quickly determine the optimum SM partition. Our evaluation shows that CD-search improves system throughput by 10.4% on average (and up to 62.9%) over even partitioning for workloads that are classified for the performance mode. For workloads classified for the power mode, CD-search reduces power consumption by 25% on average (and up to 41.2%). CD-search incurs limited runtime overhead.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121121065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Zwift
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205325
Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, Wenguang Chen
{"title":"Zwift","authors":"Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, Wenguang Chen","doi":"10.1145/3205289.3205325","DOIUrl":"https://doi.org/10.1145/3205289.3205325","url":null,"abstract":"Today's rapidly growing document volumes pose pressing challenges to modern document analytics frameworks, in both space usage and processing time. Recently, a promising method, called text analytics directly on compressed data (TADOC), was proposed for improving both the time and space efficiency of text analytics. The main idea of the technique is to enable direct document analytics on compressed data. This paper focuses on the programming challenges for developing efficient TADOC programs. It presents Zwift, the first programming framework for TADOC, which consists of a Domain Specific Language, a compiler and runtime, and a utility library. Experiments show that Zwift significantly improves programming productivity, while effectively unleashing the power of TADOC, producing code that reduces storage usage by 90.8% and execution time by 41.0% on six text analytics problems.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122304953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency PA-SSD:支持页面类型的TLC SSD,可提高读写性能和存储效率
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205319
Wenhui Zhang, Q. Cao, Hong Jiang, Jie Yao
{"title":"PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency","authors":"Wenhui Zhang, Q. Cao, Hong Jiang, Jie Yao","doi":"10.1145/3205289.3205319","DOIUrl":"https://doi.org/10.1145/3205289.3205319","url":null,"abstract":"TLC flash has three types of pages to accommodate the three bits in each TLC physical cell exhibiting very different program latencies, LSB (fast), CSB (medium), and MSB (slow). Conventional TLC SSD designs on page allocation to write requests do not take page types and their latency difference into consideration, missing on an important opportunity to exploit the potentials of fast writes. This paper proposes PA-SSD, a page-type aware TLC SSD design, to effectively improve the overall performance by judiciously and coordinately utilizing the three types of pages on TLC flash when serving user write requests. The main idea behind PA-SSD is to coordinately allocate the same type of pages for sub-requests of any given user write request, to mitigate the potential program latency imbalance among the sub-requests. We achieve the PA-SSD design goal by addressing two key research problems: (1) how to properly determine page-type for each user write request and (2) how to allocate a physical page for each sub-request with an assigned page type from (1). For the first problem, seven page-type specifying schemes are proposed to investigate their effects under different workloads. On the other hand, we approach the second problem by redesigning the page allocation strategy in TLC SSD to uniformly and sequentially determine pages for allocation following the programming process of TLC flash. Under a wide range of workloads, our experiments show that PA-SSD can accelerate both the write and read performance without any sacrifice to storage capacity. Particularly, our proposed queue-depth based page-type specifying scheme improves write performance by 2.4 times and read performance by 1.5 times over the conventional TLC SSD.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115664372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Optimizing Data Aggregation by Leveraging the Deep Memory Hierarchy on Large-scale Systems 利用大规模系统上的深度内存层次结构优化数据聚合
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205316
François Tessier, P. Gressier, V. Vishwanath
{"title":"Optimizing Data Aggregation by Leveraging the Deep Memory Hierarchy on Large-scale Systems","authors":"François Tessier, P. Gressier, V. Vishwanath","doi":"10.1145/3205289.3205316","DOIUrl":"https://doi.org/10.1145/3205289.3205316","url":null,"abstract":"Effective data aggregation is of paramount importance for data-centric applications in order to improve data movement for I/O or to facilitate complex workflows, such as in-situ analysis, as well as coupling models and data for multi-physics. A key challenge for data aggregation in current and upcoming architectures is the heterogeneity of memory and storage systems (including DRAM, MCDRAM, NVRAM or parallel file system). One has to take advantage of this hierarchy and the characteristics of each tier to achieve improved performance at scale. In this paper, we present a topology and memory-aware data movement library performing data aggregation on large-scale systems. We first detail our hardware abstraction layer to accomplish code and performance portability on various platforms. Next, we present a cost model taking into account the system interconnect and the memory properties to determine an appropriate location for aggregating data. We also describe how we have implemented a data aggregation mechanism through the read algorithm. Finally, we show how we can improve data movement on a visualization cluster and a leadership-class supercomputer up to 16K processes with a benchmark and two typical I/O kernels. Particularly, we demonstrate how our approach can decrease the I/O time of a classic workflow by 26%.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124418342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ReGraph ReGraph
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205292
X. Li, Mingxing Zhang, Kang Chen, Yongwei Wu
{"title":"ReGraph","authors":"X. Li, Mingxing Zhang, Kang Chen, Yongwei Wu","doi":"10.1145/3205289.3205292","DOIUrl":"https://doi.org/10.1145/3205289.3205292","url":null,"abstract":"\"Think Like a Sub-Graph (TLASG)\" is a philosophy proposed for guiding the design of graph-oriented programming models. As TLASG-based models allow information to flow freely inside a partition, they usually require much fewer iterations to converge when compared with \"Think Like a Vertex (TLAV)\"-based models. In this paper, we further explore the idea of TLASG by enabling users to 1) proactively repartition the graph; and 2) efficiently scale down the problem's size. With these methods, our novel TLASG-based distributed graph processing system ReGraph requires even fewer iterations (typically ≤ 6) to converge, and hence achieves better performance (up to 45.4X) and scalability than existing TLAV and TLASG-based frameworks. Moreover, we show that these optimizations can be enabled without a large change in the programming model. We also implement our novel algorithm on top of Spark directly and compare it with other Spark-based implementation, which shows that our speedup is not bounded to our own platform.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123675451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation 雕刻家:灵活的近似选择动态环穿孔
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205317
Shikai Li, Sunghyun Park, S. Mahlke
{"title":"Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation","authors":"Shikai Li, Sunghyun Park, S. Mahlke","doi":"10.1145/3205289.3205317","DOIUrl":"https://doi.org/10.1145/3205289.3205317","url":null,"abstract":"Loop perforation is one of the most well known software techniques in approximate computing. It transforms loops to periodically skip subsets of their iterations. It is general, simple, and effective. However, during analysis, it only considers the number of instructions to skip, but does not consider the differences between instructions and loop iterations. Based on our observation, these differences have considerable influence on performance and accuracy. To improve traditional perforation, we introduce selective dynamic loop perforation, a general approximation technique that automatically transforms loops to skip selected instructions in selected iterations. It provides the flexibility to craft approximation strategies at the dynamic instruction level. The main challenges in selective dynamic loop perforation are how to capture the characteristics of instructions, optimize perforation strategies based on these characteristics, and minimize additional runtime overhead. In this paper, we propose several compiler optimizations to resolve these challenges, including optimized instruction-level, load based and store based selective perforation, and self-directed dynamic perforation with a dynamic start and dynamic perforation rates. Across a range of 8 applications from various domains, selective dynamic loop perforation achieves average speedups of 2.89x and 4.07x with 5% and 10% error budgets, while traditional loop perforation achieves 1.47x and 1.93x, respectively, for the same error budgets.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117025825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Isometry 等距
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205301
Zhihao Jia, Sean Treichler, G. Shipman, Patricia McCormick, A. Aiken
{"title":"Isometry","authors":"Zhihao Jia, Sean Treichler, G. Shipman, Patricia McCormick, A. Aiken","doi":"10.1145/3205289.3205301","DOIUrl":"https://doi.org/10.1145/3205289.3205301","url":null,"abstract":"Data transfers in parallel systems have a significant impact on the performance of applications. Most existing systems generally support only data transfers between memories with a direct hardware connection and have limited facilities for handling transformations to the data's layout in memory. As a result, to move data between memories that are not directly connected, higher levels of the software stack must explicitly divide a multi-hop transfer into a sequence of single-hop transfers and decide how and where to perform data layout conversions if needed. This approach results in inefficiencies, as the higher levels lack enough information to plan transfers as a whole, while the lower level that does the transfer sees only the individual single-hop requests. We present Isometry, a path-based distributed data transfer system. The Isometry path planner selects an efficient path for a transfer and submits it to the Isometry runtime, which is optimized for managing and coordinating the direct data transfers. The Isometry runtime automatically pipelines sequential direct transfers within a path and can incorporate flexible scheduling policies, such as prioritizing one transfer over another. Our evaluation shows that Isometry can speed up data transfers by up to 2.2x and reduce the completion time of high priority transfers by up to 95% compared to the baseline Realm data transfer system. We evaluate Isometry on three benchmarks and show that Isometry reduces transfer time by up to 80% and overall completion time by up to 60%.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121569583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies 利用计算依赖减少大型共享内存系统上的数据移动
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205310
Isaac Sánchez Barrera, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero, Marc Casas
{"title":"Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies","authors":"Isaac Sánchez Barrera, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero, Marc Casas","doi":"10.1145/3205289.3205310","DOIUrl":"https://doi.org/10.1145/3205289.3205310","url":null,"abstract":"Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52X and average improvements of 1.12X with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28X on average with respect to the state-of-the-art.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
GRU
Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205318
Husheng Zhou, Soroush Bateni, Cong Liu
{"title":"GRU","authors":"Husheng Zhou, Soroush Bateni, Cong Liu","doi":"10.1145/3205289.3205318","DOIUrl":"https://doi.org/10.1145/3205289.3205318","url":null,"abstract":"Graphics processing units (GPUs) have been widely adopted by major cloud vendors for better performance and energy efficiency. Recent research has observed a considerable degree of redundancy in managing computation and data in many datacenters, particularly for several important categories of GPU-accelerated applications such as log mining and machine learning. In this paper, we present GRU, an ecosystem that smartly manages and shares GPU resources through exploiting redundancy. GRU transparently interprets GPU-accelerated computing requests and memoizes results for potential future reuse. To enhance reusability, GRU implements a partial result reuse idea, where GPU computation requests even with different input data and functionality may become reusable w.r.t. each other. To guarantee correctness of partial reuse, GRU employs a compiler-assisted approach that analyzes general data parallel patterns that are reliable for the reuse purpose, and is capable of smartly recognizing such reusable data parallel patterns of incoming requests. We have fully implemented GRU and conducted extensive sets of experiments running micro-benchmarks on local machines and real-world applications including Spark-based uses cases in an AWS cluster. Evaluation results show that GRU is effective in identifying and eliminating redundant GPU computations, achieving up to 5x (2.5x) speedup for compute-intensive (data-intensive) benchmarks. In addition, GPU-managed Spark observes a reduction of 25.3% (39.8%) on average w.r.t. turnaround time (GPU occupation time) over state-of-the-art solutions.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115733690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信