Proceedings of the 2018 International Conference on Supercomputing最新文献_第3页

PFault PFault

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205302

Jinrui Cao, Om Rameshwar Gatla, Mai Zheng, Dong Dai, Vidya Eswarappa, Yan Mu, Yong Chen

{"title":"PFault","authors":"Jinrui Cao, Om Rameshwar Gatla, Mai Zheng, Dong Dai, Vidya Eswarappa, Yan Mu, Yong Chen","doi":"10.1145/3205289.3205302","DOIUrl":"https://doi.org/10.1145/3205289.3205302","url":null,"abstract":"High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121160325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs 多任务gpu中有效SM划分的分类驱动搜索

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205311

Xia Zhao, Zhiying Wang, L. Eeckhout

{"title":"Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs","authors":"Xia Zhao, Zhiying Wang, L. Eeckhout","doi":"10.1145/3205289.3205311","DOIUrl":"https://doi.org/10.1145/3205289.3205311","url":null,"abstract":"Graphics processing units (GPUs) feature an increasing number of streaming multiprocessors (SMs) with each successive generation. At the same time, GPUs are increasingly widely adopted in cloud services and data centers to accelerate general-purpose workloads. Running multiple applications on a GPU in such environments requires effective multitasking support. Spatial multitasking in which independent applications co-execute on different sets of SMs is a promising solution to share GPU resources. Unfortunately, how to effectively partition SMs is an open problem. In this paper, we observe that compared to widely-used even partitioning, dynamic SM partitioning based on the characteristics of the co-executing applications can significantly improve performance and power efficiency. Unfortunately finding an effective SM partition is challenging because the number of possible combinations increases exponentially with the number of SMs and co-executing applications. Through offline analysis, we find that first classifying workloads, and then searching an effective SM partition based on the workload characteristics can significantly reduce the search space, making dynamic SM partitioning tractable. Based on these insights, we propose Classification-Driven search (CD-search) for low-overhead dynamic SM partitioning in multitasking GPUs. CD-search first classifies workloads using a novel off-SM bandwidth model, after which it enters the performance mode or power mode depending on the workload's characteristics. Both modes follow a specific search strategy to quickly determine the optimum SM partition. Our evaluation shows that CD-search improves system throughput by 10.4% on average (and up to 62.9%) over even partitioning for workloads that are classified for the performance mode. For workloads classified for the power mode, CD-search reduces power consumption by 25% on average (and up to 41.2%). CD-search incurs limited runtime overhead.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121121065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Zwift

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205325

Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, Wenguang Chen

引用次数: 3

PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency PA-SSD:支持页面类型的TLC SSD，可提高读写性能和存储效率

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205319

Wenhui Zhang, Q. Cao, Hong Jiang, Jie Yao

{"title":"PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency","authors":"Wenhui Zhang, Q. Cao, Hong Jiang, Jie Yao","doi":"10.1145/3205289.3205319","DOIUrl":"https://doi.org/10.1145/3205289.3205319","url":null,"abstract":"TLC flash has three types of pages to accommodate the three bits in each TLC physical cell exhibiting very different program latencies, LSB (fast), CSB (medium), and MSB (slow). Conventional TLC SSD designs on page allocation to write requests do not take page types and their latency difference into consideration, missing on an important opportunity to exploit the potentials of fast writes. This paper proposes PA-SSD, a page-type aware TLC SSD design, to effectively improve the overall performance by judiciously and coordinately utilizing the three types of pages on TLC flash when serving user write requests. The main idea behind PA-SSD is to coordinately allocate the same type of pages for sub-requests of any given user write request, to mitigate the potential program latency imbalance among the sub-requests. We achieve the PA-SSD design goal by addressing two key research problems: (1) how to properly determine page-type for each user write request and (2) how to allocate a physical page for each sub-request with an assigned page type from (1). For the first problem, seven page-type specifying schemes are proposed to investigate their effects under different workloads. On the other hand, we approach the second problem by redesigning the page allocation strategy in TLC SSD to uniformly and sequentially determine pages for allocation following the programming process of TLC flash. Under a wide range of workloads, our experiments show that PA-SSD can accelerate both the write and read performance without any sacrifice to storage capacity. Particularly, our proposed queue-depth based page-type specifying scheme improves write performance by 2.4 times and read performance by 1.5 times over the conventional TLC SSD.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115664372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Optimizing Data Aggregation by Leveraging the Deep Memory Hierarchy on Large-scale Systems 利用大规模系统上的深度内存层次结构优化数据聚合

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205316

François Tessier, P. Gressier, V. Vishwanath

{"title":"Optimizing Data Aggregation by Leveraging the Deep Memory Hierarchy on Large-scale Systems","authors":"François Tessier, P. Gressier, V. Vishwanath","doi":"10.1145/3205289.3205316","DOIUrl":"https://doi.org/10.1145/3205289.3205316","url":null,"abstract":"Effective data aggregation is of paramount importance for data-centric applications in order to improve data movement for I/O or to facilitate complex workflows, such as in-situ analysis, as well as coupling models and data for multi-physics. A key challenge for data aggregation in current and upcoming architectures is the heterogeneity of memory and storage systems (including DRAM, MCDRAM, NVRAM or parallel file system). One has to take advantage of this hierarchy and the characteristics of each tier to achieve improved performance at scale. In this paper, we present a topology and memory-aware data movement library performing data aggregation on large-scale systems. We first detail our hardware abstraction layer to accomplish code and performance portability on various platforms. Next, we present a cost model taking into account the system interconnect and the memory properties to determine an appropriate location for aggregating data. We also describe how we have implemented a data aggregation mechanism through the read algorithm. Finally, we show how we can improve data movement on a visualization cluster and a leadership-class supercomputer up to 16K processes with a benchmark and two typical I/O kernels. Particularly, we demonstrate how our approach can decrease the I/O time of a classic workflow by 26%.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124418342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ReGraph ReGraph

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205292

X. Li, Mingxing Zhang, Kang Chen, Yongwei Wu

引用次数: 4

Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation 雕刻家:灵活的近似选择动态环穿孔

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205317

Shikai Li, Sunghyun Park, S. Mahlke

{"title":"Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation","authors":"Shikai Li, Sunghyun Park, S. Mahlke","doi":"10.1145/3205289.3205317","DOIUrl":"https://doi.org/10.1145/3205289.3205317","url":null,"abstract":"Loop perforation is one of the most well known software techniques in approximate computing. It transforms loops to periodically skip subsets of their iterations. It is general, simple, and effective. However, during analysis, it only considers the number of instructions to skip, but does not consider the differences between instructions and loop iterations. Based on our observation, these differences have considerable influence on performance and accuracy. To improve traditional perforation, we introduce selective dynamic loop perforation, a general approximation technique that automatically transforms loops to skip selected instructions in selected iterations. It provides the flexibility to craft approximation strategies at the dynamic instruction level. The main challenges in selective dynamic loop perforation are how to capture the characteristics of instructions, optimize perforation strategies based on these characteristics, and minimize additional runtime overhead. In this paper, we propose several compiler optimizations to resolve these challenges, including optimized instruction-level, load based and store based selective perforation, and self-directed dynamic perforation with a dynamic start and dynamic perforation rates. Across a range of 8 applications from various domains, selective dynamic loop perforation achieves average speedups of 2.89x and 4.07x with 5% and 10% error budgets, while traditional loop perforation achieves 1.47x and 1.93x, respectively, for the same error budgets.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117025825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Isometry 等距

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205301

Zhihao Jia, Sean Treichler, G. Shipman, Patricia McCormick, A. Aiken

{"title":"Isometry","authors":"Zhihao Jia, Sean Treichler, G. Shipman, Patricia McCormick, A. Aiken","doi":"10.1145/3205289.3205301","DOIUrl":"https://doi.org/10.1145/3205289.3205301","url":null,"abstract":"Data transfers in parallel systems have a significant impact on the performance of applications. Most existing systems generally support only data transfers between memories with a direct hardware connection and have limited facilities for handling transformations to the data's layout in memory. As a result, to move data between memories that are not directly connected, higher levels of the software stack must explicitly divide a multi-hop transfer into a sequence of single-hop transfers and decide how and where to perform data layout conversions if needed. This approach results in inefficiencies, as the higher levels lack enough information to plan transfers as a whole, while the lower level that does the transfer sees only the individual single-hop requests. We present Isometry, a path-based distributed data transfer system. The Isometry path planner selects an efficient path for a transfer and submits it to the Isometry runtime, which is optimized for managing and coordinating the direct data transfers. The Isometry runtime automatically pipelines sequential direct transfers within a path and can incorporate flexible scheduling policies, such as prioritizing one transfer over another. Our evaluation shows that Isometry can speed up data transfers by up to 2.2x and reduce the completion time of high priority transfers by up to 95% compared to the baseline Realm data transfer system. We evaluate Isometry on three benchmarks and show that Isometry reduces transfer time by up to 80% and overall completion time by up to 60%.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121569583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies 利用计算依赖减少大型共享内存系统上的数据移动

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205310

Isaac Sánchez Barrera, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero, Marc Casas

{"title":"Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies","authors":"Isaac Sánchez Barrera, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero, Marc Casas","doi":"10.1145/3205289.3205310","DOIUrl":"https://doi.org/10.1145/3205289.3205310","url":null,"abstract":"Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software. We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52X and average improvements of 1.12X with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28X on average with respect to the state-of-the-art.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125801974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

ComPEND 概略

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI: 10.1145/3205289.3205295

Dongwook Lee, Sungbum Kang, Kiyoung Choi

引用次数: 6