{"title":"Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches","authors":"Lei Jin, Sangyeun Cho","doi":"10.1109/ICPP.2008.29","DOIUrl":"https://doi.org/10.1109/ICPP.2008.29","url":null,"abstract":"This paper presents a two-part study on managing distributed NUCA (non-uniform cache architecture) L2caches in a future many core processor to obtain high single thread program performance. The first part of our study is a limit study where we determine data to cache slice mappings at the memory page granularity based on detailed inter-page conflict information derived from program's memory reference trace. By considering cache access latency and cache miss rate simultaneously when mapping data to L2 cache slices, this \"oracle\" scheme outperforms the conventional shared caching scheme by up to 208% with an average of 45% on a sixteen-core processor. In the second part of the study, we propose and evaluate a dynamic cache management scheme that determines the home cache slice and cache bin for memory pages without any static program information. The dynamic scheme outperforms the shared caching scheme by up to 191% with an average of 32%, achieving much of the performance we observed in the limit study. We also find that the proposed dynamic scheme adapts to multiprogrammed workloads' behavior well and performs significantly better than both the private caching scheme and the shared caching scheme.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Dynamic Load Balancing Using UPC","authors":"Stephen L. Olivier, J. Prins","doi":"10.1109/ICPP.2008.19","DOIUrl":"https://doi.org/10.1109/ICPP.2008.19","url":null,"abstract":"An asynchronous work-stealing implementation of dynamic load balance is implemented using Unified Parallel C (UPC) and evaluated using the Unbalanced Tree Search (UTS) benchmark [Olivier, S., et al., 2007]. The UTS benchmark presents a synthetic tree-structured search space that is highly imbalanced. Parallel implementation of the search requires continuous dynamic load balancing to keep all processors engaged in the search. Our implementation achieves better scaling and parallel efficiency in both shared memory and distributed memory settings than previous efforts using UPC [Olivier, S., et al., 2007] and MPI [Dinan, J., et al., 2007]. We observe parallel efficiency of 80% using 1024 processors performing over 85,000 total load balancing operations per second continuously. The UPC programming model provides substantial simplifications in the expression of the asynchronous work stealing protocol compared with MPI. However, to obtain performance portability with UPC in both shared memory and distributed memory settings requires the careful use of one sided reads and writes to minimize the impact of high latency communication. Additional protocol improvements are made to improve dissemination of available work and to decrease the cost of termination detection.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129014359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Ahn, D. Arnold, B. Supinski, Gregory L. Lee, B. Miller, M. Schulz
{"title":"Overcoming Scalability Challenges for Tool Daemon Launching","authors":"D. Ahn, D. Arnold, B. Supinski, Gregory L. Lee, B. Miller, M. Schulz","doi":"10.1109/ICPP.2008.63","DOIUrl":"https://doi.org/10.1109/ICPP.2008.63","url":null,"abstract":"Many tools that target parallel and distributed environments must co-locate a set of daemons with the distributed processes of the target application. However, efficient and portable deployment of these daemons on large scale systems is an unsolved problem. We overcome this gap with LaunchMON, a scalable, robust, portable, secure, and general purpose infrastructure for launching tool daemons. Its API allows tool builders to identify all processes of a target job, launch daemons on the relevant nodes and control daemon interaction. Our results show that LaunchMON scales to very large daemon counts and substantially enhances performance over existing ad hoc mechanisms.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127206925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impacts of Indirect Blocks on Buffer Cache Energy Efficiency","authors":"Jianhui Yue, Yifeng Zhu, Zhao Cai","doi":"10.1109/ICPP.2008.60","DOIUrl":"https://doi.org/10.1109/ICPP.2008.60","url":null,"abstract":"Indirect blocks, part of a file's metadata used for locating this file's data blocks, are typically treated indistinguishably from file's data blocks in buffer cache. This paper shows that this conventional approach will significantly detriment the overall energy efficiency of memory systems. Scattering small but frequently accessed indirected blocks over allmemory chips reduce the energy saving opportunities. We propose a new energy-efficient buffer cache management scheme, named MEEP, which separates indirect and datablocks into different memory chips. Our trace-driven simulation results show that our new scheme can save memory energy up to 16.8% and 15.4% in the I/O-intensive server workloads TPC-R and TPC-H, respectively.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128714187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GeWave: Geographically-Aware Wave for File Consistency Maintenance in P2P Systems","authors":"Haiying Shen","doi":"10.1109/ICPP.2008.52","DOIUrl":"https://doi.org/10.1109/ICPP.2008.52","url":null,"abstract":"File consistency maintenance in P2P systems is a technique for maintaining consistency between files and their replicas. Most traditional consistency maintenance methods depend on either message spreading or structure for update propagation by pushing. Message spreading generates high overhead due to redundant messages, and cannot guarantee that every replica node receives an update. Structure-based pushing methods reduce the overhead but cannot guarantee timely consistency in churn. Moreover, most methods are unable to consider physical proximity to improve efficiency. To further reduce update overhead, enhance guarantee of consistency, and take proximity into account, this paper presents a geographically-aware Wave method (GeWave). Depending on adaptive polling in a dynamic structure, GeWave avoids redundant file updates by dynamically adapting to time-varying file update and query rates, and ensures the consistency of query results even in churn. Furthermore, it conducts update propagation between geographically close nodes in a distributed manner. Simulation results demonstrate the efficiency of GeWave in comparison with other representative consistency maintenance schemes. It dramatically reduces the overhead and yields significant improvements on efficiency and scalability of file consistency maintenance schemes.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123634491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utility-Based Distributed Routing in Intermittently Connected Networks","authors":"Ze Li, Haiying Shen","doi":"10.1109/ICPP.2008.77","DOIUrl":"https://doi.org/10.1109/ICPP.2008.77","url":null,"abstract":"Intermittently connected mobile networks don't have a complete path from a source to a destination at most of the time. Such an environment can be found in very sparse mobile networks where nodes meet only occasionally or in wireless sensor networks where nodes always sleep to conserve energy. Current transmission approaches in such networks are primarily based on: multi-copy flooding scheme and single-copy forwarding scheme. However, they incur either high overheads due to excessive transmissions or long delay due to possible incorrect choices during forwarding. In this paper, we propose a A utility-based distributed routing algorithm with multi-copies called UDM, in which a packet is initially replicated to a certain number of its neighbor nodes, which sequentially forward those packets to the destination node based on a probabilistic routing scheme. Some buffer management methods are also proposed to further improve its performance. Theoretical analyze and simulations show that compared to epidemic routing, spray and wait routing, UDM routing scheme provides a nearly optimal delay performance with a stable packet arrive rate in the community mobility model.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124289899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Inferencing for OWL Knowledge Bases","authors":"R. Soma, V. Prasanna","doi":"10.1109/ICPP.2008.64","DOIUrl":"https://doi.org/10.1109/ICPP.2008.64","url":null,"abstract":"We examine the problem of parallelizing the inferencing process for OWL knowledge-bases. A key challenge in this problem is partitioning the computational workload of this process to minimize duplication of computation and the amount of data communicated among processors. We investigate two approaches to address this challenge. In the data partitioning approach, the data-set is partitioned into smaller units, which are then processed independently. In the rule partitioning approach the rule-base is partitioned and the smaller rule-bases are applied to the complete data set. We present various algorithms for the partitioning and analyze their advantages and disadvantages. A parallel inferencing algorithm is presented which uses the partitions that are created by the two approaches. We then present an implementation based on a popular open source OWL reasoner and on a networked cluster. Our experimental results show significant speedups for some popular benchmarks, thus making this a promising approach.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126209792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory Access Scheduling Schemes for Systems with Multi-Core Processors","authors":"Hongzhong Zheng, Jiang Lin, Zhao Zhang, Zhichun Zhu","doi":"10.1109/ICPP.2008.53","DOIUrl":"https://doi.org/10.1109/ICPP.2008.53","url":null,"abstract":"On systems with multi-core processors, the memory access scheduling scheme plays an important role not only in utilizing the limited memory bandwidth but also in balancing the program execution on all cores. In this study, we propose a scheme, called ME-LREQ, which considers the utilization of both processor cores and memory subsystem. It takes into consideration both the long-term and short-term gains of serving a memory request by prioritizing requests hitting on the row buffers and from the cores that can utilize memory more efficiently and have fewer pending requests. We have also thoroughly evaluated a set of memory scheduling schemes that differentiate and prioritize requests from different cores. Our simulation results show that for memory-intensive, multiprogramming workloads, the new policy improves the overall performance by 10.7% on average and up to 17.7% on a four-core processor, when compared with scheme that serves row buffers hit memory requests first and allows memory reads bypassing writes; and by up to 9.2% (6.4% on average) when compared with the scheme that serves requests from the core with the fewest pending requests first.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127349072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bandwidth-Efficient Continuous Query Processing over DHTs","authors":"Yingwu Zhu","doi":"10.1109/ICPP.2008.11","DOIUrl":"https://doi.org/10.1109/ICPP.2008.11","url":null,"abstract":"In this paper, we propose novel techniques to reduce bandwidth cost in a continuous keyword query processing system that is based on a distributed hash table. We argue that query indexing and document announcement are of significant importance towards this goal. Our detailed simulations show that our proposed techniques, combined together, effectively and greatly reduce bandwidth cost.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121795720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Wang, Zhe Zhang, Sudharshan S. Vazhkudai, Xiaosong Ma, F. Mueller
{"title":"On-the-Fly Recovery of Job Input Data in Supercomputers","authors":"Chao Wang, Zhe Zhang, Sudharshan S. Vazhkudai, Xiaosong Ma, F. Mueller","doi":"10.1109/ICPP.2008.28","DOIUrl":"https://doi.org/10.1109/ICPP.2008.28","url":null,"abstract":"Storage system failure is a serious concern as we approach Petascale computing. Even at today's sub-Petascale levels, I/O failure is the leading cause of downtimes and job failures. We contribute a novel, on-the-fly recovery framework for job input data into supercomputer parallel file systems. The framework exploits key traits of the HPC I/O workload to reconstruct lost input data during job execution from remote, immutable copies. Each reconstructed data stripe is made immediately accessible in the client request order due to the delayed metadata update and fine-granular locking while unrelated access to the same file remains unaffected. We have implemented the recovery component within the Lustre parallel file system, thus building a novel application-transparent online recovery solution. Our solution is integrated into Lustre's two-level locking scheme using a two-phase blocking protocol. Combining parametric and simulation studies, our experiments demonstrate a significant improvement in HPC center service ability and user job turnaround time.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133950733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}