Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores最新文献

筛选
英文 中文
Intra-Task Parallelism in Automotive Real-Time Systems 汽车实时系统中的任务内并行性
Remko van Wagensveld, Tobias Wägemann, Niklas Hehenkamp, Ramin Tavakoli Kolagari, Ulrich Margull, Ralph Mader
{"title":"Intra-Task Parallelism in Automotive Real-Time Systems","authors":"Remko van Wagensveld, Tobias Wägemann, Niklas Hehenkamp, Ramin Tavakoli Kolagari, Ulrich Margull, Ralph Mader","doi":"10.1145/3178442.3178449","DOIUrl":"https://doi.org/10.1145/3178442.3178449","url":null,"abstract":"Many recent Engine Management Systems (EMSs) have multicore processors. This results in new challenges for the developers of those systems, as most of them are not familiar with multicore programming. Additionally, many of the EMSs have real-time requirements, which need to be met. This paper introduces embedded parallel design patterns (ePDPs), which help developers solving common problems encountered when trying to parallelize legacy code for EMSs or embedded devices. We present a novel ePDP called Supercore Pattern. It helps to reduce the overhead introduced from forking or joining control graphs. To show the effectiveness of this pattern we simulated and executed it on a real-world EMS and show that the pattern is able to reduce the response time of tasks with real-time requirements. This paper also presents concrete extensions to AUTOSAR, and EAST-ADL, to enable the modelling of the supercore pattern in automotive modelling standards.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"183 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116875959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Understanding Parallelization Tradeoffs for Linear Pipelines 理解线性管道的并行化权衡
Aristeidis Mastoras, T. Gross
{"title":"Understanding Parallelization Tradeoffs for Linear Pipelines","authors":"Aristeidis Mastoras, T. Gross","doi":"10.1145/3178442.3178443","DOIUrl":"https://doi.org/10.1145/3178442.3178443","url":null,"abstract":"Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is challenging and current techniques, e.g., PS-DSWP and LBPP, have difficulties handling load-imbalanced loops. Particularly, for loop iterations that differ substantially in execution time, these techniques achieve load-balancing by assigning work to threads using round-robin scheduling. Algorithms that rely on work-stealing e.g., Piper, efficiently handle load-imbalanced loops, but the high overhead of the scheduler implies poor performance for fine-grained loops. In this paper, we present Proteas, a programming model to allow tradeoffs between load-balancing, partitioning, mapping, synchronization, chunking, and scheduling. Proteas provides a set of simple directives to express the different mappings to handle a loop's parallelism. Then, a source-to-source compiler generates parallel code to support experimentation with Proteas. The directives allow us to investigate various tradeoffs and achieve good performance according to PS-DSWP and LBPP. In addition, the directives make a meaningful comparison to Piper possible. We present a performance evaluation on a 32-core system for a set of popular pipelined programs selected from three widely-used benchmark suites. The results show the tradeoffs of the different techniques and their parameters. Moreover, the results show that efficient handling of load-imbalanced fine-grained loops remains the main challenge.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"13 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126059960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Supporting Fine-grained Dataflow Parallelism in Big Data Systems 支持大数据系统中的细粒度数据流并行性
Sebastian Ertel, Justus Adam, J. Castrillón
{"title":"Supporting Fine-grained Dataflow Parallelism in Big Data Systems","authors":"Sebastian Ertel, Justus Adam, J. Castrillón","doi":"10.1145/3178442.3178447","DOIUrl":"https://doi.org/10.1145/3178442.3178447","url":null,"abstract":"Big data systems scale with the number of cores in a cluster for the parts of an application that can be executed in data parallel fashion. It has been recently reported, however, that these systems fail to translate hardware improvements, such as increased network bandwidth, into a higher throughput. This is particularly the case for applications that have inherent sequential, computationally intensive phases. In this paper, we analyze the data processing cores of state-of-the-art big data systems to find the cause for these scalability problems. We identify design patterns in the code that are suitable for pipeline and task-level parallelism, potentially increasing application performance. As a proof of concept, we rewrite parts of the Hadoop MapReduce framework in an implicit parallel language that exploits this parallelism without adding code complexity. Our experiments on a data analytics workload show throughput speedups of up to 3.5x.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126291271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Reduction to Band Form for the Singular Value Decomposition on Graphics Accelerators 图形加速器上奇异值分解的带形约简
A. Tomás, Rafael Rodríguez-Sánchez, Sandra Catalán, E. S. Quintana‐Ortí
{"title":"Reduction to Band Form for the Singular Value Decomposition on Graphics Accelerators","authors":"A. Tomás, Rafael Rodríguez-Sánchez, Sandra Catalán, E. S. Quintana‐Ortí","doi":"10.1145/3178442.3178448","DOIUrl":"https://doi.org/10.1145/3178442.3178448","url":null,"abstract":"In this paper we show that two-stage algorithms for the singular value decomposition (SVD) significantly benefit from an alternative reduction to a intermediate by-product after the first stage that consists of a band matrix with the same upper and lower bandwidth. This is in contrast with the conventional approach, which produces an upper triangular-band matrix. In comparison, our alternative easily accommodates a look-ahead strategy, with minor constraints on the relation between the algorithmic block size and the bandwidth, yielding a high-performance implementation on current servers equipped with multicore technology and graphics processors.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134009891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast and Accurate Performance Analysis of Synchronization 快速准确的同步性能分析
Mario Badr, Natalie D. Enright Jerger
{"title":"Fast and Accurate Performance Analysis of Synchronization","authors":"Mario Badr, Natalie D. Enright Jerger","doi":"10.1145/3178442.3178446","DOIUrl":"https://doi.org/10.1145/3178442.3178446","url":null,"abstract":"Understanding parallel program bottlenecks is critical to designing more efficient and performant parallel architectures. Synchronization is a prime example of a potential bottleneck, but is a necessary evil when writing parallel programs; we must enforce correct access to shared data. Even the most expert programmers may find synchronization to be a significant overhead in their application. Techniques to mitigate synchronization overhead include speculative lock elision, faster hardware barriers, and load balancing via dynamic voltage and frequency scaling and thread migration to asymmetric cores. A key insight is that the timing of synchronization events, impacted not only by the progress of the current thread but also others, is fundamental to an application's performance. To enable a better understanding of multithreaded applications, we propose analytical model centered around the timing and ordering of synchronization events. Our model allows research across the stack to evaluate the performance of applications on future, nonexistent systems and architectures. Compared to real hardware, our model estimates performance with an average of 7.2% error across thirteen benchmarks and can generate performance characteristics per thread in less than a minute on average for very large (native) inputs.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134279500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
VAIL: A Victim-Aware Cache Policy for Improving Lifetime of Hybrid Memory VAIL:一种提高混合内存寿命的受害者感知缓存策略
Youchuang Jia, Fang Zhou, Xiang Gao, Song Wu, Hai Jin, Xiaofei Liao, Pingpeng Yuan
{"title":"VAIL: A Victim-Aware Cache Policy for Improving Lifetime of Hybrid Memory","authors":"Youchuang Jia, Fang Zhou, Xiang Gao, Song Wu, Hai Jin, Xiaofei Liao, Pingpeng Yuan","doi":"10.1145/3178442.3178451","DOIUrl":"https://doi.org/10.1145/3178442.3178451","url":null,"abstract":"Nowadays emerging Non-Volatile Memory (NVM) technologies are introduced to remedy the shortages of the current DRAM-based memory system. However, NVM has limited write endurance, which would severely restrict the performance of memory system. In order to relieve above limitation, we proposed a victim-aware cache policy (VAIL), based on DRAM and NVM hybrid memory system. VAIL took the eviction locality of victims from the DRAM cache into consideration, to reduce writebacks to NVM and improve DRAM hit ratio at the same time. Our evaluation shows that VAIL could reduce 17.2% writebacks to NVM with single-core workloads, 22% writebacks to NVM with multi-core workloads and improve the DRAM hit ratio of the hybrid memory system.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"138 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116374986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers 用任务并行版本的BiCG扩展双gpu服务器上的ILUPACK
J. Aliaga, M. Bollhöfer, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí
{"title":"Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers","authors":"J. Aliaga, M. Bollhöfer, Ernesto Dufrechu, P. Ezzatti, E. S. Quintana‐Ortí","doi":"10.1145/3178442.3178450","DOIUrl":"https://doi.org/10.1145/3178442.3178450","url":null,"abstract":"We target the solution of sparse linear systems via iterative Krylov subspace-based methods enhanced with the ILUPACK preconditioner on graphics processing units (GPUs). Concretely, in this work we extend ILUPACK with an implementation of the BiCG solver capable of exploiting dual-GPU systems. We leverage the structure of the BiCG to execute the main stages of the solver in a concurrent manner, and take advantage of the extended memory space to improve the data access patterns. The experimental results on a server with two NVIDIA K40 GPUs show important acceleration factors with respect to a previous single GPU variant.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126178246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUs 现代cpu上矢量化和缓存重用权衡的评估
Du Shen, Milind Chabbi, Xu Liu
{"title":"An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUs","authors":"Du Shen, Milind Chabbi, Xu Liu","doi":"10.1145/3178442.3178445","DOIUrl":"https://doi.org/10.1145/3178442.3178445","url":null,"abstract":"Emerging high-performance processor architectures show two key trends: longer vector units and deeper memory hierarchies. It is not always possible to exploit both vectorization and locality. Prior optimization techniques have focused on either vectorization for data parallelism or cache reuse for low latency ignoring the interference between the two. For high performance, software needs to either exploit both or choose the one that offers larger gains despite the losses incurred by the other. This paper demonstrates that vectorization vs. locality tradeoff can be influenced by code shape, working set size, and architectures. We first devise metrics to precisely classify these tradeoffs. We then design representative micro benchmarks to study the tradeoffs between vectorization and different types of locality on multiple architectures. Based on our learning from microbenchmark studies, we optimize several important HPC benchmarks on multiple CPU architectures.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133194301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Combining PREM compilation and ILP scheduling for high-performance and predictable MPSoC execution 结合PREM编译和ILP调度,实现高性能和可预测的MPSoC执行
J. Matejka, Björn Forsberg, M. Sojka, Z. Hanzálek, L. Benini, A. Marongiu
{"title":"Combining PREM compilation and ILP scheduling for high-performance and predictable MPSoC execution","authors":"J. Matejka, Björn Forsberg, M. Sojka, Z. Hanzálek, L. Benini, A. Marongiu","doi":"10.1145/3178442.3178444","DOIUrl":"https://doi.org/10.1145/3178442.3178444","url":null,"abstract":"Many applications require both high performance and predictable timing. High-performance can be provided by COTS Multi-Core System on Chips (MPSoC), however, as cores in these systems share the memory bandwidth they are susceptible to interference from each other, which is a problem for timing predictability. We achieve predictability on multi-cores by employing the predictable execution model (PREM), which splits execution into a sequence of memory and compute phases, and schedules these such that only a single core is executing a memory phase at a time. We present a toolchain consisting of a compiler and an Integer Linear Programming scheduling model. Our compiler uses loop analysis and tiling to transform application code into PREM compliant binaries. Furthermore, we solve the problem of scheduling execution on multiple cores while preventing interference of memory phases. We evaluate our toolchain on Advanced-Driver-Assistance-Systems-like scenario containing matrix multiplications and FFT computations on NVIDIA TX1. The results show that our approach maintains similar average performance and improves variance of completion times by a factor of 9.","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121344400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores 第九届多核与多核编程模型与应用国际研讨会论文集
{"title":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","authors":"","doi":"10.1145/3178442","DOIUrl":"https://doi.org/10.1145/3178442","url":null,"abstract":"","PeriodicalId":328694,"journal":{"name":"Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129050378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信