{"title":"Deciphering Predictive Schedulers for Heterogeneous-ISA Multicore Architectures","authors":"A. Prodromou, A. Venkat, D. Tullsen","doi":"10.1145/3303084.3309492","DOIUrl":"https://doi.org/10.1145/3303084.3309492","url":null,"abstract":"Heterogeneous architectures have become increasingly common. From co-packaging small and large cores, to GPUs alongside CPUs, to general-purpose heterogeneous-ISA architectures with cores implementing different ISAs. As diversity of execution cores grows, predictive models become of paramount importance for scheduling and resource allocation. In this paper, we investigate the capabilities of performance predictors in a heterogeneous-ISA setting, as well as the predictors' effects on scheduler quality. We follow an unbiased feature selection methodology to identify the optimal set of features for this task, instead of pre-selecting features before training. We propose metrics that bridge the gap between traditional prediction accuracy metrics and a scheduler's performance. We further present our evaluation methodology, which was meticulously designed with this study in mind, and finally, we incorporate our findings in ML-based schedulers and evaluate their sensitivity to the underlying system's level of heterogeneity.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123555887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuchong Xia, Xiangyao Yu, William S. Moses, Julian Shun, S. Devadas
{"title":"LiTM: A Lightweight Deterministic Software Transactional Memory System","authors":"Yuchong Xia, Xiangyao Yu, William S. Moses, Julian Shun, S. Devadas","doi":"10.1145/3303084.3309487","DOIUrl":"https://doi.org/10.1145/3303084.3309487","url":null,"abstract":"Deterministic software transactional memory (STM) is a useful programming model for writing parallel codes, as it improves programmability (by supporting transactions) and debuggability (by supporting determinism). This paper presents LiTM, a new deterministic STM system that achieves both simplicity and efficiency at the same time. LiTM implements the deterministic reservations framework of Blelloch et al., but without requiring the programmer to understand the internals of the algorithm. Instead, the programmer writes the program in a transactional fashion and LiTM manages all data conflicts and automatically achieves deterministic parallelism. Our experiments on six benchmarks show that LiTM outperforms the state-of-the-art framework Galois by up to 5.8× on a 40-core machine.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124120564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures","authors":"Biagio Peccerillo, S. Bartolini","doi":"10.1145/3303084.3309496","DOIUrl":"https://doi.org/10.1145/3303084.3309496","url":null,"abstract":"Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-efficiency of these diverse processing elements does not come for free from a software point of view: programmers need to a) code each activity through the specific approaches, libraries, and frameworks suitable for their target architecture (e.g., CPUs and GPUs) along with the orchestration of such heterogeneous execution, and b) decide the distribution of sequential and parallel activities towards the different parallel hardware resources available. Current frameworks typically provide either low-abstraction-level target-specific and/or generic but not high-performance interfaces, which complicate the exploration of different task assignments, with DAG1 precedence relationship, to the available heterogeneous resources. To enable this, tasks would typically need to be coded one time for each target architecture due to the profound differences in their programming. In this work, we include the support of tasks and DAGs of data-parallel tasks within the single-source PHAST library, which currently supports both multi-core CPUs and NVIDIA GPUs, so that tasks are coded in a target-agnostic fashion and their targeting to multi-core or GPU architectures is automatic and efficient. The integration of this coding approach with tasks can help to postpone the choice of the execution platform for each task up to the testing, or even to the runtime, phase. Finally, we demonstrate the effects of this approach in the case of a sample image pipeline benchmark from the computer vision domain. We compare our implementation to a SYCL implementation from a productivity point of view. Also, we show that various task assignments can be seamlessly explored by implementing both the PEFT2 mapping technique along with an exhaustive search in the mapping space.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126501720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Process Barrier for Predictable and Repeatable Concurrent Execution","authors":"Masataka Nishi","doi":"10.1145/3303084.3309494","DOIUrl":"https://doi.org/10.1145/3303084.3309494","url":null,"abstract":"We study on how to design, debug and verify and validate (V&V) safety-critical control software running on shared-memory many-core platforms. Managing concurrency in a verifiable way is a certification requirement. The presented process barrier is a simple concurrency control mechanism that guarantees deadlock-freedom by-design and temporal separation of tasks, while allowing non-conflicting tasks to run in parallel. It is placed in a lock-free task queue (LFTQ) and a group of processors are allocated to compete to dequeue and execute the tasks registered in the LFTQ. The process barrier consists of a checker and limiter pair. A process that dequeues the checker monitors for completion of preceding tasks in the LFTQ that conflicts with a subsequent task in the LFTQ. The process dequeues the paired limiter from the LFTQ upon completion. All other processes that find the limiter at the head of the LFTQ periodically checks if the head of the LFTQ points to subsequent tasks which happens after the process that took the checker task dequeues the limiter. The mechanism manages concurrent execution of the registered tasks that conflict on data, shared resources and execution order in a way that becomes conflict equivalent to sequential execution. The trace of the concurrent execution and the consequent program state is repeatable. We can reuse existing toolchains for single-core platforms for debugging, testing and V&V. The temporal behavior of the concurrent execution becomes predictable and the worst-case execution time (WCET) of it is bounded.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125494862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Laborde, Lance Lebanoff, Christina L. Peterson, Deli Zhang, D. Dechev
{"title":"Wait-free Dynamic Transactions for Linked Data Structures","authors":"P. Laborde, Lance Lebanoff, Christina L. Peterson, Deli Zhang, D. Dechev","doi":"10.1145/3303084.3309491","DOIUrl":"https://doi.org/10.1145/3303084.3309491","url":null,"abstract":"Transactional data structures support threads executing a sequence of operations atomically. Dynamic transactions allow operands to be generated on the fly and allows threads to execute code in between the operations of a transaction, in contrast to static transactions which need to know the operands in advance. A framework called Lock-free Transactional Transformation (LFTT) allows data structures to run high-performance transactions, but it only supports static transactions. We extend LFTT to add support for dynamic transactions and wait-free progress while retaining its speed. The thread-helping scheme of LFTT presents a unique challenge to dynamic transactions. We overcome this challenge by changing the input of LFTT from a list of operations to a function, forcing helping threads to always start at the beginning of the transaction, and allowing threads to skip completed operations through the use of a list of return values. We thoroughly evaluate the performance impact of support for dynamic transactions and wait-free progress and find that these features do not hurt the performance of LFTT for our test cases.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117248625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Don't Forget About Synchronization!: A Case Study of K-Means on GPU","authors":"J. Nelson, R. Palmieri","doi":"10.1145/3303084.3309488","DOIUrl":"https://doi.org/10.1145/3303084.3309488","url":null,"abstract":"Heterogeneous devices are becoming necessary components of high performance computing infrastructures, and the graphics processing unit (GPU) plays an important role in this landscape. Given a problem, the established approach for exploiting the GPU is to design solutions that are parallel, without data or flow dependencies. These solutions are then offloaded to the GPU's massively parallel capability. This design principle (i.e., avoiding contention) often leads to developing applications that cannot maximize GPU hardware utilization. The goal of this paper is to challenge this common belief by empirically showing that allowing even simple forms of synchronization enables programmers to design parallel solutions that admit conflicts and achieve better utilization of hardware parallelism. Our experience shows that lock-based solutions to the k-means clustering problem outperform the well-engineered and parallel KMCUDA on both synthetic and real datasets; averaging 8.4x faster runtimes at high contention and 8.1x faster for low contention, with maximums of 25.4x and 74x, respectively. We summarize our findings by identifying two guidelines to help make concurrency effective when programming GPU applications.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128831203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Formal Verification through Combinatorial Topology: the CAS-Extended Model","authors":"Christina L. Peterson, D. Dechev","doi":"10.1145/3303084.3309493","DOIUrl":"https://doi.org/10.1145/3303084.3309493","url":null,"abstract":"Wait-freedom guarantees that all processes complete their operations in a finite number of steps regardless of the delay of any process. Combinatorial topology has been proposed in the literature as a formal verification technique to prove the wait-free computability of decision tasks. Wait-freedom is proved through the properties of a static topological structure that expresses all possible combinations of execution paths of the protocol solving the decision task. The practical application of combinatorial topology as a formal verification technique is limited because the existing theory only considers protocols in which the manner of communication between processes is through read-write memory. This research proposes an extension to the existing theory, called the CAS-extended model. The extended theory includes Compare-And-Swap (CAS) and Load-Linked/Store-Conditional (LL/SC) which are atomic primitives used to achieve wait-freedom in state-of-the-art protocols. The CAS-extended model theory can be used to formally verify wait-free algorithms used in practice, such as concurrent data structures. We present new definitions detailing the construction of a protocol complex in the CAS-extended model. As a proof-of-concept, we formally verify a wait-free queue with three processes using the CAS-extended combinatorial topology.","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115569594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","authors":"","doi":"10.1145/3303084","DOIUrl":"https://doi.org/10.1145/3303084","url":null,"abstract":"","PeriodicalId":408167,"journal":{"name":"Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117234256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}