L. Ceze, C. V. Praun, Calin Cascaval, Pablo Montesinos, J. Torrellas
{"title":"Concurrency control with data coloring","authors":"L. Ceze, C. V. Praun, Calin Cascaval, Pablo Montesinos, J. Torrellas","doi":"10.1145/1353522.1353525","DOIUrl":"https://doi.org/10.1145/1353522.1353525","url":null,"abstract":"Concurrency control is one of the main sources of error and complexity in shared memory parallel programming. While there are several techniques to handle concurrency control such as locks and transactional memory, simplifying concurrency control has proved elusive.\u0000 In this paper we introduce the Data Coloring programming model, based on the principles of our previous work on architecture support for data-centric synchronization. The main idea is to group data structures into consistency domains and mark places in the control flow where data should be consistent. Based on these annotations, the system dynamically infers transaction boundaries. An important aspect of data coloring is that the occurrence of a synchronization defect is typically determinate and leads to a violation of liveness rather than to a safety violation. Finally, this paper includes empirical data that shows that most of the critical sections in large applications are used in a data-centric manner.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126282401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The potential for variable-granularity access tracking for optimistic parallelism","authors":"Mihai Burcea, J. Gregory Steffan, C. Amza","doi":"10.1145/1353522.1353527","DOIUrl":"https://doi.org/10.1145/1353522.1353527","url":null,"abstract":"Support for optimistic parallelism such as thread-level speculation (TLS) and transactional memory (TM) has been proposed to ease the task of parallelizing software to exploit the new abundance of multicores. A key requirement for such support is the mechanism for tracking memory accesses so that conflicts between speculative threads or transactions can be detected; existing schemes mainly track accesses at a single fixed granularity---i.e., at the word level, cache-line level, or page level. In this paper we demonstrate, for a hardware implementation of TLS and corresponding speculatively-parallelized SpecINT benchmarks, that the coarsest access tracking granularity that does not incur false violations varies significantly across applications, within applications, and across ranges of memory---from word-size to page size. These results motivate a variable-granularity approach to access tracking, and we show that such an approach can reduce the number of memory ranges that must be tracked and compared to detect conflicts can be reduced by an order of magnitude compared to word-level tracking, without increasing false violations. We are currently developing variable-granularity implementations of both a hardware-based TLS system and an STM system.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1429 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132670514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Smaragdakis, Anthony Kay, R. Behrends, M. Young
{"title":"General and efficient locking without blocking","authors":"Y. Smaragdakis, Anthony Kay, R. Behrends, M. Young","doi":"10.1145/1353522.1353524","DOIUrl":"https://doi.org/10.1145/1353522.1353524","url":null,"abstract":"Standard concurrency control mechanisms offer a trade-off: Transactional memory approaches maximize concurrency, but suffer high overheads and cost for retrying in the case of actual contention. Locking offers lower overheads, but typically reduces concurrency due to the difficulty of associating locks with the exact data that need to be accessed. Moreover, locking allows irreversible operations, is ubiquitous in legacy software, and seems unlikely to ever be completely supplanted.\u0000 We believe that the trade-off between transactions and (blocking) locks has not been sufficiently exploited to obtain a \"best of both worlds\" mechanism, although the main components have been identified. Mechanisms for converting locks to atomic sections (which can abort and retry) have already been proposed in the literature: Rajwar and Goodman's \"lock elision\" (at the hardware level) and Welc et al.'s hybrid monitors (at the software level) are the best known representatives. Nevertheless, these approaches admit improvements on both the generality and the performance front. In this position paper we present two ideas. First, we discuss an adaptive criterion for switching from a locking to a transactional implementation, and back to a locking implementation if the transactional one appears to be introducing overhead for no gain in concurrency. Second, we discuss the issues arising when locks are nested. Contrary to assertions in past work, transforming locks into transactions can be incorrect in the presence of nesting. We explain the problem and provide a precise condition for safety.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127420130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliability-aware data placement for partial memory protection in embedded processors","authors":"M. Mehrara, T. Austin","doi":"10.1145/1178597.1178600","DOIUrl":"https://doi.org/10.1145/1178597.1178600","url":null,"abstract":"Low cost protection of embedded systems against soft errors has recently become a major concern. This issue is even more critical in memory elements that are inherently more prone to transient faults. In this paper, we propose a reliability aware data placement technique in order to partially protect embedded memory systems. We show that by adopting this method instead of traditional placement schemes with complete memory protection, an acceptable level of fault tolerance can be achieved while incurring less area and power overhead. In this approach, each variable in the program is placed in either protected or non-protected memory area according to the profile-driven liveness analysis of all memory variables. In order to measure the level of fault coverage, we inject faults into the memory during the course of program execution in a Monte Carlo simulation framework. Subsequently, we calculate the coverage of partial protection scheme based on the number of protected, failed and crashed runs during the fault injection experiment.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123413034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"What do high-level memory models mean for transactions?","authors":"D. Grossman, Jeremy Manson, W. Pugh","doi":"10.1145/1178597.1178609","DOIUrl":"https://doi.org/10.1145/1178597.1178609","url":null,"abstract":"Many people have proposed adding transactions, or atomic blocks, to type-safe high-level programming languages. However, researchers have not considered the semantics of transactions with respect to a memory model weaker than sequential consistency. The details of such semantics are more subtle than many people realize, and the interaction between compiler transformations and transactions could produce behaviors that many people find surprising. A language's memory model, which determines these interactions, must clearly indicate which behaviors are legal, and which are not. These design decisions affect both the idioms that are useful for designing concurrent software and the compiler transformations that are legal within the language.Cases where semantics are more subtle than people expect include the actual meaning of both strong and weak atomicity; correct idioms for thread safe lazy initialization; compiler transformations of transactions that touch only thread local memory; and whether there is a well-defined notion for transactions that corresponds to the notion of correct and incorrect use of synchronization in Java. Open questions for a high-level memory-model that includes transactions involve both issues of isolation and ordering.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126873471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote talk challenges in chip multiprocessor memory systems","authors":"D. Wood","doi":"10.1145/1178597.1178607","DOIUrl":"https://doi.org/10.1145/1178597.1178607","url":null,"abstract":"The semiconductor industry appears on the brink of an arms race, competing to see which company can cram the most cores on a single die. Yet early entries are hardly well-balanced, general-purpose computers: Sun's Niagara has feeble floating-point performance and IBM's Cell processor is a thinly disguised GPU. Worse, no company has announced anything that will help address the real problem: programming the multithreaded applications needed to exploit the ample computational resources. This talk will discuss several challenges facing the memory system designers of emerging computation-rich chip multiprocessors.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125302378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinzhan Peng, Guei-Yuan Lueh, Gansha Wu, Xiaogang Gou, R. Rakvic
{"title":"A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems","authors":"Jinzhan Peng, Guei-Yuan Lueh, Gansha Wu, Xiaogang Gou, R. Rakvic","doi":"10.1145/1178597.1178614","DOIUrl":"https://doi.org/10.1145/1178597.1178614","url":null,"abstract":"The working set size of Java applications on embedded systems has recently been increasing, causing the Translation Lookaside Buffer (TLB) to become a serious performance bottleneck. From a thorough analysis of the SPECjvm98 benchmark suite executing on a commodity embedded system, we find TLB misses attribute from 24% to 50% of the total execution time. We explore and evaluate a wide spectrum of TLB-enhancing techniques with different combinations of software/hardware approaches, namely superpage for reducing TLB miss rates, two-level TLB and TLB prefetching for reducing both TLB miss rates and TLB miss latency, and even a no-TLB design for removing TLB overhead completely. We adapt and then in a novel way extend these approaches to fit the design space of embedded systems executing Java code. We compare these approaches, discussing their performance behavior, software/hardware complexity and constraints, especially the design implications for the application, runtime and OS.We first conclude that even with the aggressive approaches presented, there remains a performance bottleneck with the TLB. Second, in addition to facing very different design considerations and constraints for embedded systems, proven hardware techniques, such as TLB prefetching have different performance implications. Third, software based solutions, no-TLB design and superpaging, appear to be more effective in improving Java application performance on embedded systems. Finally, beyond performance, these approaches have their respective pros and cons; it is left to the system designer to make the appropriate engineering tradeoff.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129626911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick
{"title":"Implicit and explicit optimizations for stencil computations","authors":"S. Kamil, K. Datta, Samuel Williams, L. Oliker, J. Shalf, K. Yelick","doi":"10.1145/1178597.1178605","DOIUrl":"https://doi.org/10.1145/1178597.1178605","url":null,"abstract":"Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121990913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Atomicity via source-to-source translation","authors":"Benjamin Hindman, D. Grossman","doi":"10.1145/1178597.1178611","DOIUrl":"https://doi.org/10.1145/1178597.1178611","url":null,"abstract":"We present an implementation and evaluation of atomicity (also known as software transactions) for a dialect of Java. Our implementation is fundamentally different from prior work in three respects: (1) It is entirely a source-to-source translation, producing Java source code that can be compiled by any Java compiler and run on any Java Virtual Machine. (2) It can enforce \"strong\" atomicity without assuming special hardware or a uniprocessor. (3) The implementation uses locks rather than optimistic concurrency, but it cannot deadlock and requires inter-thread communication only when there is data contention.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123871248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory models for open-nested transactions","authors":"Kunal Agrawal, C. Leiserson, Jim Sukha","doi":"10.1145/1178597.1178610","DOIUrl":"https://doi.org/10.1145/1178597.1178610","url":null,"abstract":"Open nesting provides a loophole in the strict model of atomic transactions. Moss and Hosking suggested adapting open nesting for transactional memory, and Moss and a group at Stanford have proposed hardware schemes to support open nesting. Since these researchers have described their schemes using only operational definitions, however, the semantics of these systems have not been specified in an implementation-independent way. This paper offers a framework for defining and exploring the memory semantics of open nesting in a transactionl-memory setting.Our framework allows us to define the traditional model of serializability and two new transactional-memory models, race freedom and prefix race freedom. The weakest of these memory models, prefix race freedom, closely resembles the Stanford openesting model. We prove that these three memory models are equivalent for transactional-memory systems that support only closed nesting, as long as aborted transactions are \"ignored.\" We prove that for systems that support open nesting, however, the models of serializability, race freedom, and prefix race freedom are distinct. We show that the Stanford TM system implements a model at least as strong as prefix race freedom and strictly weaker than race freedom. Thus, their model compromises serializability, the property traditionally used to reason about the correctness of transactions.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129537382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}