{"title":"Handling Conflicts with Compiler's Help in Software Transactional Memory Systems","authors":"Sandya Mannarswamy, R. Govindarajan","doi":"10.1109/ICPP.2010.56","DOIUrl":"https://doi.org/10.1109/ICPP.2010.56","url":null,"abstract":"Atomic sections are supported in software through the use of optimistic concurrency by using Software Transactional Memory (STM). However STM implementations incur high overheads which reduce the wide-spread use of this approach by programmers. Conflicts are a major source of overheads in STMs. The basic performance premise of a transactional memory system is the optimistic concurrency principle wherein data updates executed by the transactions are to disjoint objects/memory locations, referred to as Disjoint Access Parallel (DAP). Otherwise, the updates conflict, and all but one of the transactions are aborted. Such aborts result in wasted work and performance degradation. While contention management systems in STM implementations try to reduce conflicts by various runtime feedback control mechanisms, they are not aware of the application’s structure and data access patterns and hence typically act after the conflicts have occurred. In this paper we propose a scheme based on compiler analysis, which can identify static atomic sections whose instances, when executed concurrently by more than one thread always conflict. Such an atomic section is referred to as Always Conflicting Atomic Section (ACAS). We propose and evaluate two techniques Selective Pessimistic Concurrency Control (SPCC) and compiler inserted Early Conflict Checks (ECC) which can help reduce the STM overheads caused by ACAS. We show that these techniques help reduce the aborts in 4 of the STAMP benchmarks by up to 27.52% while improving performance by 1.24% to 19.31%.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126684772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Gray-Box Feedback Control Approach for System-Level Peak Power Management","authors":"Jiayu Gong, Chengzhong Xu","doi":"10.1109/ICPP.2010.63","DOIUrl":"https://doi.org/10.1109/ICPP.2010.63","url":null,"abstract":"Power consumption has become one of the most important design considerations for modern high density servers. To avoid system failures caused by power capacity overload or overheating, system-level power management is required. This kind of management needs to control power consumption precisely. Conventional solutions to this problem mostly rely on feedback controllers which only concern the power itself, known as black-box approaches. They may not respond to the variation of system quickly. This paper presents a gray-box strategy to design a model-predictive feedback controller based on a pre-built power model and a performance prediction model to constraint the peak power consumption of a server. In contrast to the existing strategies, this gray-box approach uses the performance events, which bring more insights of the behaviors and power consumption of a system, for the purpose of model prediction. We implemented a prototype of this controller and evaluated it using SPECweb2005 benchmark on a web server. This controller can settle the power consumption below the power cap within 2 control periods for more than 75% of the power overloading regardless of workload variations, outperforming black-box approaches. Meanwhile, the performance of application can be maximized with this controller.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"545 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132446287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous Mini-rank: Adaptive, Power-Efficient Memory Architecture","authors":"Kun Fang, Hongzhong Zheng, Zhichun Zhu","doi":"10.1109/ICPP.2010.11","DOIUrl":"https://doi.org/10.1109/ICPP.2010.11","url":null,"abstract":"Memory power consumption has become a big concern in server platforms. A recently proposed mini-rank architecture reduces the memory power consumption by breaking each DRAM rank into multiple narrow mini-ranks and activating fewer devices for each request. However, its fixed and uniform configuration may degrade performance significantly or lose power saving opportunities on some workloads. We propose a heterogeneous mini-rank design that sets the near-optimal configuration for each workload based on its memory access behavior and its memory bandwidth requirement. Compared with the original, homogeneous mini-rank design, the heterogeneous mini-rank design can balance between the performance and power saving and avoid large performance loss. For instance, for multiprogramming workloads with SPEC2000 application running on a quad-core system with two-channel DDR3-1066 memory, on average, the heterogeneous mini-rank can reduce the memory power by 53.1% (up to 60.8%) with the performance loss of 4.6% (up to 11.1%), compared with a conventional memory system. In comparison, the x32 homogeneous mini-rank can only save memory power by up to 29.8%; and the x8 homogeneous mini-rank will cause performance loss by up to 22.8%. Compared with x16 homogeneous mini-rank configuration, it can further reduce the EDP (energy-delay product) by up to 15.5% (10.0% on average).","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"50 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114027732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable Parallel Algorithm for Large-Scale Protein Sequence Homology Detection","authors":"Changjun Wu, A. Kalyanaraman, W. Cannon","doi":"10.1109/ICPP.2010.41","DOIUrl":"https://doi.org/10.1109/ICPP.2010.41","url":null,"abstract":"Protein sequence homology detection is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting homology between two protein sequences is computationally inexpensive, detecting pairwise homology at a large-scale becomes prohibitive, requiring millions of CPU hours. Yet, there is currently no efficient method available to parallelize this kernel. In this paper, we present the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for large-scale protein sequence data. Our method, called pGraph, is designed using a hierarchical multiple-master multiple-worker model, where the processor space is partitioned into subgroups and the hierarchy helps in ensuring the workload is load balanced fashion despite the inherent irregularity that may originate in the input. Experimental evaluation demonstrates that our method scales linearly on all input sizes tested (up to 640K sequences) on a 1,024 node supercomputer. In addition to demonstrating strong scaling, we present an extensive study of the various components of the system and related parametric studies.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122318280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the Relation Between Apex-Map Synthetic Probes and Reuse Distance Distributions","authors":"K. Ibrahim, E. Strohmaier","doi":"10.1109/ICPP.2010.43","DOIUrl":"https://doi.org/10.1109/ICPP.2010.43","url":null,"abstract":"Characterizing a memory reference stream using reuse distance distribution can enable predicting the performance on a given architecture. Benchmarks can subject an architecture to a limited set of reuse distance distributions, but it cannot exhaustively test it. In contrast, Apex-Map, a synthetic memory probe with parameterized locality, can provide a better coverage of the machine use scenarios. Unfortunately, it requires a lot of expertise to relate an application memory behavior to an Apex-Map parameter set. In this work we present a mathematical formulation that describes the relation between Apex-Map and reuse distance distributions. We also introduce a process through which we can automate the estimation of Apex-Map locality parameters for a given application. This process finds the best parameters for Apex-Map probes that generate a reuse distance distribution similar to that of the original application. We tested this scheme on benchmarks from Scalable Synthetic Compact Applications and Unbalanced Tree Search, and we show that this scheme provides an accurate Apex-Map parameterization with a small percentage of mismatch in reuse distance distributions, about 3% in average and less than 8% in the worst case, on the tested applications.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Lightweight, GPU-Based Software RAID System","authors":"M. Curry, L. Ward, A. Skjellum, R. Brightwell","doi":"10.1109/ICPP.2010.64","DOIUrl":"https://doi.org/10.1109/ICPP.2010.64","url":null,"abstract":"While RAID is the prevailing method of creating reliable secondary storage infrastructure, many users desire more flexibility than offered by current implementations. Traditionally, RAID capabilities have been implemented largely in hardware in order to achieve the best performance possible, but hardware RAID has rigid designs that are costly to change. Software implementations are much more flexible, but software RAID has historically been viewed as much less capable of high throughput than hardware RAID controllers. This work presents a system, Gibraltar RAID, that attains high RAID performance by offloading the calculations related to error correcting codes to GPUs. This paper describes the architecture, performance, and qualities of the system. A comparison to a well-known software RAID implementation, the md driver included with the Linux operating system, is presented. While this work is presented in the context of high performance computing, these findings also apply to a general RAID market.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121917656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Building Efficient Content-Based Publish/Subscribe Systems over Structured P2P Overlays","authors":"S. Zhang, Ji Wang, Rui Shen, Jie Xu","doi":"10.1109/ICPP.2010.33","DOIUrl":"https://doi.org/10.1109/ICPP.2010.33","url":null,"abstract":"In this paper, we introduce a generic model to deal with the event matching problem of content-based publish/subscribe systems over structured P2P overlays. In this model, we claim that there are three methods (event-oriented, subscription-oriented and hybrid) to make all the matched pairs (event, subscription) meet in a system. By theoretically analyzing the inherent problem of both event-oriented and subscription-oriented methods, we propose PEM (Popularity-based Event Matching), a variant of hybrid method. PEM can achieve better trade-off between event processing load and subscription storage load of a system. PEM has been verified through both mathematical and simulation-based evaluation.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124724712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power Management in Heterogeneous Multi-tier Web Clusters","authors":"Peijian Wang, Yongwei Qi, Xue Liu, Ying Chen, Xiao Zhong","doi":"10.1109/ICPP.2010.46","DOIUrl":"https://doi.org/10.1109/ICPP.2010.46","url":null,"abstract":"Complex web applications are usually served by multi-tier web clusters. With the growing cost of energy, the importance of reducing power consumption in server systems is now well-known and has become a major research topic. However, most of previous works focused solely on homogeneous clusters. This paper addresses the challenge of power management in Heterogeneous Multi-tier Web Clusters. We apply Generalized Benders Decomposition (GBD) to decompose the global optimization problem into small sub-problems. This algorithm achieves the optimal solution in an iterative fashion. The simulation results show that our algorithm achieve more energy conservation than the previous works.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"1999 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116921254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual-Phase Just-in-Time Workflow Scheduling in P2P Grid Systems","authors":"S. Di, Cho-Li Wang","doi":"10.1109/ICPP.2010.31","DOIUrl":"https://doi.org/10.1109/ICPP.2010.31","url":null,"abstract":"This paper presents a fully decentralized just-in-time workflow scheduling method in a P2P Grid system. The proposed solution allows each peer node to autonomously dispatch inter-dependent tasks of workflows to run on geographically distributed computers. To reduce the workflow completion time and enhance the overall execution efficiency, not only does each node perform as a scheduler to distribute its tasks to execution nodes (or resource nodes), but the resource nodes will also set the execution priorities for the received tasks. By taking into account the unpredictability of tasks’ finish time, we devise an efficient task scheduling heuristic, namely dynamic shortest makespan first (DSMF), which could be applied at both scheduling phases for determining the priority of the workflow tasks. We compare the performance of the proposed algorithm against seven other heuristics by simulation. Our algorithm achieves 20%~60% reduction on the average completion time and 37.5%~90% improvement on the average workflow execution efficiency over other decentralized algorithms.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121462855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujuan Tan, Hong Jiang, D. Feng, Lei Tian, Zhichao Yan, Guohui Zhou
{"title":"SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup","authors":"Yujuan Tan, Hong Jiang, D. Feng, Lei Tian, Zhichao Yan, Guohui Zhou","doi":"10.1109/ICPP.2010.69","DOIUrl":"https://doi.org/10.1109/ICPP.2010.69","url":null,"abstract":"Existing de-duplication solutions in cloud backup environment either obtain high compression ratios at the cost of heavy de-duplication overheads in terms of increased latency and reduced throughput, or maintain small de-duplication overheads at the cost of low compression ratios causing high data transmission costs, which results in a large backup window. In this paper, we present SAM, a Semantic-Aware Multitiered source de-duplication framework that first combines the global file-level de-duplication and local chunk-level deduplication, and further exploits file semantics in each stage in the framework, to obtain an optimal tradeoff between the deduplication efficiency and de-duplication overhead and finally achieve a shorter backup window than existing approaches. Our experimental results with real world datasets show that SAM not only has a higher de-duplication efficiency/overhead ratio than existing solutions, but also shortens the backup window by an average of 38.7%.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132196372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}