Remko van Wagensveld, Tobias Wägemann, Niklas Hehenkamp, Ramin Tavakoli Kolagari, Ulrich Margull, Ralph Mader
{"title":"Intra-Task Parallelism in Automotive Real-Time Systems","authors":"Remko van Wagensveld, Tobias Wägemann, Niklas Hehenkamp, Ramin Tavakoli Kolagari, Ulrich Margull, Ralph Mader","doi":"10.1145/3178442.3178449","DOIUrl":"https://doi.org/10.1145/3178442.3178449","url":null,"abstract":"Many recent Engine Management Systems (EMSs) have multicore processors. This results in new challenges for the developers of those systems, as most of them are not familiar with multicore programming. Additionally, many of the EMSs have real-time requirements, which need to be met. This paper introduces embedded parallel design patterns (ePDPs), which help developers solving common problems encountered when trying to parallelize legacy code for EMSs or embedded devices. We present a novel ePDP called Supercore Pattern. It helps to reduce the overhead introduced from forking or joining control graphs. To show the effectiveness of this pattern we simulated and executed it on a real-world EMS and show that the pattern is able to reduce the response time of tasks with real-time requirements. This paper also presents concrete extensions to AUTOSAR, and EAST-ADL, to enable the modelling of the supercore pattern in automotive modelling standards.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"183 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116875959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Parallelization Tradeoffs for Linear Pipelines","authors":"Aristeidis Mastoras, T. Gross","doi":"10.1145/3178442.3178443","DOIUrl":"https://doi.org/10.1145/3178442.3178443","url":null,"abstract":"Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is challenging and current techniques, e.g., PS-DSWP and LBPP, have difficulties handling load-imbalanced loops. Particularly, for loop iterations that differ substantially in execution time, these techniques achieve load-balancing by assigning work to threads using round-robin scheduling. Algorithms that rely on work-stealing e.g., Piper, efficiently handle load-imbalanced loops, but the high overhead of the scheduler implies poor performance for fine-grained loops. In this paper, we present Proteas, a programming model to allow tradeoffs between load-balancing, partitioning, mapping, synchronization, chunking, and scheduling. Proteas provides a set of simple directives to express the different mappings to handle a loop's parallelism. Then, a source-to-source compiler generates parallel code to support experimentation with Proteas. The directives allow us to investigate various tradeoffs and achieve good performance according to PS-DSWP and LBPP. In addition, the directives make a meaningful comparison to Piper possible. We present a performance evaluation on a 32-core system for a set of popular pipelined programs selected from three widely-used benchmark suites. The results show the tradeoffs of the different techniques and their parameters. Moreover, the results show that efficient handling of load-imbalanced fine-grained loops remains the main challenge.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"13 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126059960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting Fine-grained Dataflow Parallelism in Big Data Systems","authors":"Sebastian Ertel, Justus Adam, J. Castrillón","doi":"10.1145/3178442.3178447","DOIUrl":"https://doi.org/10.1145/3178442.3178447","url":null,"abstract":"Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126291271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Tomás, Rafael Rodríguez-Sánchez, Sandra Catalán, E. S. Quintana‐Ortí
{"title":"Reduction to Band Form for the Singular Value Decomposition on Graphics Accelerators","authors":"A. Tomás, Rafael Rodríguez-Sánchez, Sandra Catalán, E. S. Quintana‐Ortí","doi":"10.1145/3178442.3178448","DOIUrl":"https://doi.org/10.1145/3178442.3178448","url":null,"abstract":"In this paper we show that two-stage algorithms for the singular value decomposition (SVD) significantly benefit from an alternative reduction to a intermediate by-product after the first stage that consists of a band matrix with the same upper and lower bandwidth. This is in contrast with the conventional approach, which produces an upper triangular-band matrix. In comparison, our alternative easily accommodates a look-ahead strategy, with minor constraints on the relation between the algorithmic block size and the bandwidth, yielding a high-performance implementation on current servers equipped with multicore technology and graphics processors.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134009891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast and Accurate Performance Analysis of Synchronization","authors":"Mario Badr, Natalie D. Enright Jerger","doi":"10.1145/3178442.3178446","DOIUrl":"https://doi.org/10.1145/3178442.3178446","url":null,"abstract":"Understanding parallel program bottlenecks is critical to designing more efficient and performant parallel architectures. Synchronization is a prime example of a potential bottleneck, but is a necessary evil when writing parallel programs; we must enforce correct access to shared data. Even the most expert programmers may find synchronization to be a significant overhead in their application. Techniques to mitigate synchronization overhead include speculative lock elision, faster hardware barriers, and load balancing via dynamic voltage and frequency scaling and thread migration to asymmetric cores. A key insight is that the timing of synchronization events, impacted not only by the progress of the current thread but also others, is fundamental to an application's performance. To enable a better understanding of multithreaded applications, we propose analytical model centered around the timing and ordering of synchronization events. Our model allows research across the stack to evaluate the performance of applications on future, nonexistent systems and architectures. Compared to real hardware, our model estimates performance with an average of 7.2% error across thirteen benchmarks and can generate performance characteristics per thread in less than a minute on average for very large (native) inputs.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134279500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youchuang Jia, Fang Zhou, Xiang Gao, Song Wu, Hai Jin, Xiaofei Liao, Pingpeng Yuan
{"title":"VAIL: A Victim-Aware Cache Policy for Improving Lifetime of Hybrid Memory","authors":"Youchuang Jia, Fang Zhou, Xiang Gao, Song Wu, Hai Jin, Xiaofei Liao, Pingpeng Yuan","doi":"10.1145/3178442.3178451","DOIUrl":"https://doi.org/10.1145/3178442.3178451","url":null,"abstract":"Nowadays emerging Non-Volatile Memory (NVM) technologies are introduced to remedy the shortages of the current DRAM-based memory system. However, NVM has limited write endurance, which would severely restrict the performance of memory system. In order to relieve above limitation, we proposed a victim-aware cache policy (VAIL), based on DRAM and NVM hybrid memory system. VAIL took the eviction locality of victims from the DRAM cache into consideration, to reduce writebacks to NVM and improve DRAM hit ratio at the same time. Our evaluation shows that VAIL could reduce 17.2% writebacks to NVM with single-core workloads, 22% writebacks to NVM with multi-core workloads and improve the DRAM hit ratio of the hybrid memory system.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"138 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116374986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Aliaga, M. Bollhöfer, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí
{"title":"Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers","authors":"J. Aliaga, M. Bollhöfer, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí","doi":"10.1145/3178442.3178450","DOIUrl":"https://doi.org/10.1145/3178442.3178450","url":null,"abstract":"We target the solution of sparse linear systems via iterative Krylov subspace-based methods enhanced with the ILUPACK preconditioner on graphics processing units (GPUs). Concretely, in this work we extend ILUPACK with an implementation of the BiCG solver capable of exploiting dual-GPU systems. We leverage the structure of the BiCG to execute the main stages of the solver in a concurrent manner, and take advantage of the extended memory space to improve the data access patterns. The experimental results on a server with two NVIDIA K40 GPUs show important acceleration factors with respect to a previous single GPU variant.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126178246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUs","authors":"Du Shen, Milind Chabbi, Xu Liu","doi":"10.1145/3178442.3178445","DOIUrl":"https://doi.org/10.1145/3178442.3178445","url":null,"abstract":"Emerging high-performance processor architectures show two key trends: longer vector units and deeper memory hierarchies. It is not always possible to exploit both vectorization and locality. Prior optimization techniques have focused on either vectorization for data parallelism or cache reuse for low latency ignoring the interference between the two. For high performance, software needs to either exploit both or choose the one that offers larger gains despite the losses incurred by the other. This paper demonstrates that vectorization vs. locality tradeoff can be influenced by code shape, working set size, and architectures. We first devise metrics to precisely classify these tradeoffs. We then design representative micro benchmarks to study the tradeoffs between vectorization and different types of locality on multiple architectures. Based on our learning from microbenchmark studies, we optimize several important HPC benchmarks on multiple CPU architectures.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133194301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Matejka, Björn Forsberg, M. Sojka, Z. Hanzálek, L. Benini, A. Marongiu
{"title":"Combining PREM compilation and ILP scheduling for high-performance and predictable MPSoC execution","authors":"J. Matejka, Björn Forsberg, M. Sojka, Z. Hanzálek, L. Benini, A. Marongiu","doi":"10.1145/3178442.3178444","DOIUrl":"https://doi.org/10.1145/3178442.3178444","url":null,"abstract":"Many applications require both high performance and predictable timing. High-performance can be provided by COTS Multi-Core System on Chips (MPSoC), however, as cores in these systems share the memory bandwidth they are susceptible to interference from each other, which is a problem for timing predictability. We achieve predictability on multi-cores by employing the predictable execution model (PREM), which splits execution into a sequence of memory and compute phases, and schedules these such that only a single core is executing a memory phase at a time. We present a toolchain consisting of a compiler and an Integer Linear Programming scheduling model. Our compiler uses loop analysis and tiling to transform application code into PREM compliant binaries. Furthermore, we solve the problem of scheduling execution on multiple cores while preventing interference of memory phases. We evaluate our toolchain on Advanced-Driver-Assistance-Systems-like scenario containing matrix multiplications and FFT computations on NVIDIA TX1. The results show that our approach maintains similar average performance and improves variance of completion times by a factor of 9.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121344400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","authors":"","doi":"10.1145/3178442","DOIUrl":"https://doi.org/10.1145/3178442","url":null,"abstract":"","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129050378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}