2011 IEEE International Parallel & Distributed Processing Symposium最新文献_第8页

Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel 多分辨率张量核的模型驱动SIMD代码生成

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.101

Kevin Stock, Thomas Henretty, Iyyappa Murugandi, P. Sadayappan, R. Harrison

引用次数: 20

Moving the Code to the Data - Dynamic Code Deployment Using ActiveSpaces 将代码移动到数据-使用ActiveSpaces进行动态代码部署

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.120

C. Docan, M. Parashar, J. Cummings, S. Klasky

{"title":"Moving the Code to the Data - Dynamic Code Deployment Using ActiveSpaces","authors":"C. Docan, M. Parashar, J. Cummings, S. Klasky","doi":"10.1109/IPDPS.2011.120","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.120","url":null,"abstract":"Managing the large volumes of data produced by emerging scientific and engineering simulations running on leadership-class resources has become a critical challenge. The data has to be extracted off the computing nodes and transported to consumer nodes so that it can be processed, analyzed, visualized, archived, etc. Several recent research efforts have addressed data-related challenges at different levels. One attractive approach is to offload expensive I/O operations to a smaller set of dedicated computing nodes known as a staging area. However, even using this approach, the data still has to be moved from the staging area to consumer nodes for processing, which continues to be a bottleneck. In this paper, we investigate an alternate approach, namely moving the data-processing code to the staging area rather than moving the data. Specifically, we present the Active Spaces framework, which provides (1) programming support for defining the data-processing routines to be downloaded to the staging area, and (2) run-time mechanisms for transporting binary codes associated with these routines to the staging area, executing the routines on the nodes of the staging area, and returning the results. We also present an experimental performance evaluation of Active Spaces using applications running on the Cray XT5 at Oak Ridge National Laboratory. Finally, we use a coupled fusion application workflow to explore the trade-offs between transporting data and transporting the code required for data processing during coupling, and we characterize the sweet spots for each option.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132210492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors 在芯片多处理器的动态非统一缓存架构中增强数据搜索

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.48

Javier Lira, Carlos Molina, Antonio González

{"title":"HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors","authors":"Javier Lira, Carlos Molina, Antonio González","doi":"10.1109/IPDPS.2011.48","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.48","url":null,"abstract":"The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it difficult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache's internal wires. Traditionally, NUCA organizations have been classified as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain significant performance benefits, they used a non-affordable access scheme mechanism to find data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (emph{Home Knows where to find data within the NUCA cache}). It exploits migration features by providing fast and power efficient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an efficient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40% energy consumed per each memory request, and achieves an average performance improvement of 6%.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130228987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Tight Analysis of Relaxed Multi-organization Scheduling Algorithms 松弛多组织调度算法的严密分析

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.112

Daniel Cordeiro, P. Dutot, G. Mounié, D. Trystram

{"title":"Tight Analysis of Relaxed Multi-organization Scheduling Algorithms","authors":"Daniel Cordeiro, P. Dutot, G. Mounié, D. Trystram","doi":"10.1109/IPDPS.2011.112","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.112","url":null,"abstract":"The goal of this paper is to study how limited cooperation can impact the quality of the schedule obtained by multiple independent organizations in a typical grid computing platform. This relaxed version of the problem known as the Multi-Organization Scheduling Problem (MOSP) models an environment where organizations providing both resources and jobs tolerate a bounded degradation on the make span of their own jobs in order to minimize the make span over the entire platform. More precisely, the technical contributions are the following. First, we improve the existing in approximation bounds for this problem proving that what was previously though as not polynomially approximable ({it unless $P=NP$}) is actually not approximable at all. We achieve this using two families of instances whose Pareto optimal solutions are on par with the previous in aproximability bounds. Then, we present two algorithms that solve the problem with approximation ratios of (2, 3/2) and (3, 4/3) respectively. This means that when using the first (second) algorithm, if an organization tolerates that the completion time of its last job cannot exceed twice (three times) the time it would have obtained by itself, then the algorithm provides a solution that is a 3/2-approximation (4/3-approximation) for the optimal global make span. Both algorithms are efficient since their performance ratio correspond to the Pareto optimal solutions of the previously defined instances.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131759237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Tutorial Statement 教程的声明

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.423

B. Palmer, M. Krishnan, Abhinav Vishnu

引用次数: 0

Offline Scheduling of Multi-threaded Request Streams on a Caching Server 缓存服务器上多线程请求流的离线调度

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.111

V. Rehn-Sonigo, D. Trystram, Frédéric Wagner, Haifeng Xu, Guochuan Zhang

{"title":"Offline Scheduling of Multi-threaded Request Streams on a Caching Server","authors":"V. Rehn-Sonigo, D. Trystram, Frédéric Wagner, Haifeng Xu, Guochuan Zhang","doi":"10.1109/IPDPS.2011.111","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.111","url":null,"abstract":"In this work, we are interested in the problem of satisfying multiple concurrent requests submitted to a computing server. Informally, there are users each sending a sequence of requests to the server. The requests consist of tasks linked by precedence constraints. Tasks may occur several times in the same sequence as well as in a request sequence of another user. The computing server has to execute tasks with variable processing times. The server owns a cache of limited size where intermediate results of the processing may be stored. If an intermediate result for a task is stored into the cache, no processing cost has to be paid and the result can directly be fetched from the cache. The goal of this work is to determine a schedule of the tasks such that an optimization function is minimized (the only objective studied up to now is the make span). This problem is a variant of caching which considers only one sequence of requests. We then extend the study to the minimization of the mean completion time of the request sequences. Two models are considered. In the first model, caching is forced whereas in the second model caching is optional and one can choose whether an intermediate result is stored in the cache or not. All combinations turn out to be NP-hard for fixed cache sizes and we provide a formulation as dynamic program as well as bounds for in approximation. We propose polynomial time approximation algorithms for some variants and analyze their approximation ratios. Finally, we also devise some heuristics and present experimental results. Tasks may occur several times in the same sequence as well as in a request sequence of another user. The computing server has to execute tasks with variable processing times. The server owns a cache of limited size where intermediate results of the processing may be stored. If an intermediate result for a task is stored into the cache, no processing cost has to be paid and the result can directly be fetched from the cache. The goal of this work is to determine a schedule of the tasks such that an optimization function is minimized (the only objective studied up to now is the make span). This problem is a variant of caching which considers only one sequence of requests. We then extend the study to the minimization of the mean completion time of the request sequences. Two models are considered. In the first model, caching is forced whereas in the second model caching is optional and one can choose whether an intermediate result is stored in the cache or not. All combinations turn out to be NP-hard for fixed cache sizes and we provide a formulation as dynamic program as well as bounds for in approximation. We propose polynomial time approximation algorithms for some variants and analyze their approximation ratios. Finally, we also devise some heuristics and present experimental results.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126659237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Architecture-aware Algorithms and Software for Peta and Exascale Computing Peta和Exascale计算的架构感知算法和软件

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.419

J. Dongarra

引用次数: 3

H-Code: A Hybrid MDS Array Code to Optimize Partial Stripe Writes in RAID-6 H-Code:一种用于RAID-6中部分分条写优化的混合MDS阵列代码

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.78

Chentao Wu, Shenggang Wan, Xubin He, Q. Cao, C. Xie

{"title":"H-Code: A Hybrid MDS Array Code to Optimize Partial Stripe Writes in RAID-6","authors":"Chentao Wu, Shenggang Wan, Xubin He, Q. Cao, C. Xie","doi":"10.1109/IPDPS.2011.78","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.78","url":null,"abstract":"RAID-6 is widely used to tolerate concurrent failures of any two disks to provide a higher level of reliability with the support of erasure codes. Among many implementations, one class of codes called {bfseries{M}}aximum {bfseries{D}}istance {bfseries{S}}eparable ({bfseries{MDS}}) codes aims to offer data protection against disk failures with optimal storage efficiency. Typical MDS codes contain horizontal and vertical codes. Due to the horizontal parity, in the case of emph{partial stripe write} (refers to I/O operations that write new data or update data to a subset of disks in an array) in a row, horizontal codes may get less I/O operations in most cases, but suffer from unbalanced I/O distribution. They also have limitation on high single write complexity. Vertical codes improve single write complexity compared to horizontal codes, while they still suffer from poor performance in partial stripe writes. In this paper, we propose a new XOR-based MDS array code, named Hybrid Code (H-Code), which optimizes partial stripe writes for RAID-6 by taking advantages of both horizontal and vertical codes. H-Code is a solution for an array of $(p+1)$ disks, where $p$ is a prime number. Unlike other codes taking a dedicated anti-diagonal parity strip, H-Code uses a special anti-diagonal parity layout and distributes the anti-diagonal parity elements among disks in the array, which achieves a more balanced I/O distribution. On the other hand, the horizontal parity of H-Code ensures a partial stripe write to continuous data elements in a row share the same row parity chain, which can achieve optimal partial stripe write performance. Not only within a row but also within a stripe, H-Code offers optimal partial stripe write complexity to two continuous data elements and optimal partial stripe write performance among all MDS codes to the best of our knowledge. Specifically, compared to RDP and EVENODD codes, H-Code reduces I/O cost by up to $15.54%$ and $22.17%$. Overall, H-code has optimal storage efficiency, optimal encoding/decoding computational complexity, optimal complexity of both single write and partial stripe write.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127490415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Online Adaptive Code Generation and Tuning 在线自适应代码生成和调优

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.86

Ananta Tiwari, J. Hollingsworth

引用次数: 73

Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space 基于GPU的线性空间大序列Smith-Waterman对齐

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.114

E. Sandes, A. Melo

引用次数: 57