Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores最新文献

Software-managed Cache Coherence for fast One-Sided Communication 用于快速单侧通信的软件管理缓存一致性

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883409

Steffen Christgau, Bettina Schnor

引用次数: 5

Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis 利用静态线程干扰分析加速动态数据争用检测

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883405

Peng Di, Yulei Sui

{"title":"Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis","authors":"Peng Di, Yulei Sui","doi":"10.1145/2883404.2883405","DOIUrl":"https://doi.org/10.1145/2883404.2883405","url":null,"abstract":"Precise dynamic race detectors report an error if and only if more than one thread concurrently exhibits conflict on a memory access. They insert instrumentations at compile-time to perform runtime checks on all memory accesses to ensure that all races are captured and no spurious warnings are generated. However, a dynamic race check for a particular memory access statement is guaranteed to be redundant if the statement can be statically identified as thread interference-free. Despite significant recent advances in dynamic detection techniques, the redundant check remains a critical factor that leads to prohibitive overhead of dynamic race detection for multithreaded programs. In this paper, we present a new framework that eliminates redundant race check and boosts the dynamic race detection by performing static optimizations on top of a series of thread interference analysis phases. Our framework is implemented on top of LLVM 3.5.0 and evaluated with an industry dynamic race detector TSAN which is available as a part of LLVM tool chain. 11 benchmarks from SPLASH2 are used to evaluate the effectiveness of our approach in accelerating TSAN by eliminating redundant interference-free checks. The experimental result demonstrates our new approach achieves from 1.4x to 4.0x (2.4x on average) speedup over original TSAN under 4 threads setting, and achieves from 1.3x to 4.6x (2.6x on average) speedup under 16 threads setting.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117289976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Flow Driven GPGPU Programming combining Textual and Graphical Programming 流驱动GPGPU编程结合文本和图形编程

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883412

Thomas Hoegg, G. Fiedler, C. Koehler, A. Kolb

{"title":"Flow Driven GPGPU Programming combining Textual and Graphical Programming","authors":"Thomas Hoegg, G. Fiedler, C. Koehler, A. Kolb","doi":"10.1145/2883404.2883412","DOIUrl":"https://doi.org/10.1145/2883404.2883412","url":null,"abstract":"GPGPUs (General Purpose Computation on Graphics Processing Unit) have become the most important invention in the last years in computer graphics and the vision domain. Despite improvement of the two main programming platforms, CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language), GPGPU programming and development is still a complex, time consuming and error-prone task. To overcome these problems for general software engineering, the graphical modeling language UML (Unified Modeling Language) was introduced and became the first choice for designing software systems. However, its generic design causes representations of algorithmic problem descriptions to be either limited or too complicated. We present GU-DSL, a novel domain-specific language (DSL), including novel modeling concepts (new activity-diagram node types and special language constructs), based on Eclipse Xtext and GMF, adopting and extending class- and activity-diagrams in a textual and graphical form. Furthermore, we present a C++ and OpenCL code generation framework in combination with a heterogeneous C++ GPGPU computing framework allowing for a smooth connection with our DSL and graphical editors.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121733863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Discovering Pipeline Parallel Patterns in Sequential Legacy C++ Codes 在顺序遗留c++代码中发现管道并行模式

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883411

David del Rio Astorga, M. F. Dolz, Luis Miguel Sánchez, José Daniel García Sánchez

{"title":"Discovering Pipeline Parallel Patterns in Sequential Legacy C++ Codes","authors":"David del Rio Astorga, M. F. Dolz, Luis Miguel Sánchez, José Daniel García Sánchez","doi":"10.1145/2883404.2883411","DOIUrl":"https://doi.org/10.1145/2883404.2883411","url":null,"abstract":"Since free performance lunch of processors is over, parallelism has become the new trend in hardware and architecture design. However, parallel resources deployed in data centers are underused in many cases, given that sequential programming is still deeply rooted in current software development. To face this problem, new methodologies and techniques for parallel programming have been progressively developed. For instance, parallel frameworks offer programming skeletons that allow expressing parallelism and concurrency in applications to better exploit concurrent hardware. Nevertheless, it remains a large portion of production software, coming from a broad range of scientific and industrial areas, that still execute sequential legacy codes. Taking into account that these software modules contain thousands, or even millions, of code lines, the effort needed to identify parallel regions is extremely high. To pave the way in this area, this paper presents Parallel Pattern Analyzer Tool (PPAT), a software component that aids discovering and annotating parallel patterns in source codes. Hence, facilitating the transformation of sequential code into parallel. We evaluate this tool for the special case of parallel pipelines using a series of well-known sequential benchmark suites.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129164492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Multitasking Real-time Embedded GPU Computing Tasks 多任务实时嵌入式GPU计算任务

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883408

Pınar Muyan-Özçelik, John Douglas Owens

引用次数: 8

Enhancing Metaheuristic-based Virtual Screening Methods on Massively Parallel and Heterogeneous Systems 基于元启发式的大规模并行异构系统虚拟筛选方法的改进

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883413

Baldomero Imbernón, J. Cecilia, D. Giménez

{"title":"Enhancing Metaheuristic-based Virtual Screening Methods on Massively Parallel and Heterogeneous Systems","authors":"Baldomero Imbernón, J. Cecilia, D. Giménez","doi":"10.1145/2883404.2883413","DOIUrl":"https://doi.org/10.1145/2883404.2883413","url":null,"abstract":"Molecular docking through Virtual Screening is an optimization problem which can be approached with metaheuristic methods. The interaction between two chemical compounds (typically a protein or receptor and small molecule or ligand) is measured with computationally very demanding scoring functions and can, moreover, be measured at several spots throughout the receptor. For the simulation of large molecules, it is necessary to scale to large clusters to deal with memory and computational requirements. In this paper, we analyze the current landscape of computation, where massive parallelism and heterogeneity are today the main ingredients in large-scale computing systems, to enhance metaheuristic-based virtual screening methods, and thus facilitate the analysis of large molecules. We provide a parallelization strategy aimed at leveraging these features. Our solution finds a good workload balance via dynamic assignment of jobs to heterogeneous resources which perform independent metaheuristic executions under different molecular interactions. A cooperative scheduling of jobs optimizes the quality of the solution and the overall performance of the simulation, so opening a new path for further developments of Virtual Screening methods on high-performance contemporary heterogeneous platforms.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128559508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Embedding Semantics of the Single-Producer/Single-Consumer Lock-Free Queue into a Race Detection Tool 单生产者/单消费者无锁队列嵌入竞争检测工具的语义

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883406

M. F. Dolz, David del Rio Astorga, Javier Fernández, José Daniel García Sánchez, Félix García Carballeira, M. Danelutto, M. Torquati

{"title":"Embedding Semantics of the Single-Producer/Single-Consumer Lock-Free Queue into a Race Detection Tool","authors":"M. F. Dolz, David del Rio Astorga, Javier Fernández, José Daniel García Sánchez, Félix García Carballeira, M. Danelutto, M. Torquati","doi":"10.1145/2883404.2883406","DOIUrl":"https://doi.org/10.1145/2883404.2883406","url":null,"abstract":"The rapid progress of multi-/many-core architectures has caused data-intensive parallel applications not yet be fully suited for getting the maximum performance. The advent of parallel programming frameworks offering structured patterns has alleviated developers' burden adapting such applications to parallel platforms. For example, the use of synchronization mechanisms in multithreaded applications is essential on shared-cache multi-core architectures. However, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels may influence the occurrence of data races. The benefits of race detectors are formidable in this sense, nevertheless if lock-free data structures with no high-level atomics are used, they may emit false positives. In this paper, we extend the ThreadSanitizer race detection tool in order to support semantics of the general Single-Producer/Single-Consumer (SPSC) lock-free parallel queue and to detect benign data races where it was correctly used. To perform our analysis, we leverage the FastFlow SPSC bounded lock-free queue implementation to test our extensions over a set of μ-benchmarks and real applications on a dual-socket Intel Xeon CPU E5-2695 platform. We demonstrate that this approach can reduce, on average, 30% the number of data race warning messages.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128852891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient Parallelization of Complex Automotive Systems 复杂汽车系统的高效并行化

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883421

Julian Kienberger, Christian Saad, Stefan Kuntz, B. Bauer

{"title":"Efficient Parallelization of Complex Automotive Systems","authors":"Julian Kienberger, Christian Saad, Stefan Kuntz, B. Bauer","doi":"10.1145/2883404.2883421","DOIUrl":"https://doi.org/10.1145/2883404.2883421","url":null,"abstract":"As the automotive industry seeks to include more and more features in its vehicles while simultaneously attempting to reduce the number of \"Electronic Control Units\" (ECUs) that execute the corresponding embedded software, the necessary policy shift towards multi-core technology is in full swing. In order to eventually exploit the extra processing power, there is much additional effort needed for coping with the tremendously increased complexity of such systems. This is largely due to the elaborate parallelization process (partitioning, mapping and scheduling software parts as tasks on different cores) that results in a combinatorial explosion and thus spans a vast search space. Mastering this challenge requires innovative methods and appropriate tools that are specifically designed for the creation of embedded multi-core applications or the migration of legacy software [16]. On the basis of the concept presented in [25], we use the results of its data dependency analysis performed on an \"AUTOSAR\" model (AUTOSAR system descriptions) to determine advantageous partitions as well as initial task-to-core mappings. Afterwards, the extracted information serves as input for the simulation within an embedded multi-core timing tool suite. Here, the initial solution is evaluated with respect to the fulfillment of basic timing requirements and metrics like cross-core communication rates, average latencies or core workloads. A subsequent optimization process improves the initial solution and enables a comparative assessment. In order to demonstrate the benefit of this approach, we apply it to two models -- a fictional mid-sized and a real-life complex one -- and show the advantage compared to a parallelization process without preceding dependency analysis and initial partition/mapping suggestions.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

An Evaluation of Emerging Many-Core Parallel Programming Models 新兴多核并行编程模型的评价

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883420

Matt Martineau, Simon McIntosh-Smith, M. Boulton, W. Gaudin

引用次数: 43

On Guided Installation of Basic Linear Algebra Routines in Nodes with Manycore Components 多核节点中基本线性代数例程的引导安装

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI: 10.1145/2883404.2883422

Luis-Pedro García, J. Cuenca, Francisco-José Herrera, D. Giménez

{"title":"On Guided Installation of Basic Linear Algebra Routines in Nodes with Manycore Components","authors":"Luis-Pedro García, J. Cuenca, Francisco-José Herrera, D. Giménez","doi":"10.1145/2883404.2883422","DOIUrl":"https://doi.org/10.1145/2883404.2883422","url":null,"abstract":"Computational systems are nowadays composed of basic computational components which share multiprocessors and coprocessors of different types, typically several GPUs or MICs. The software previously developed and optimized for simpler systems needs to be redesigned and re-optimized for these new, more complex systems. The adaptation to hybrid multicore+multiGPU and multicore+multiMIC of auto-tuning techniques for basic linear algebra routines is analyzed. The matrix-matrix multiplication kernel, which is optimized for different computational system components through guided experimentation, is studied. The basic matrix-matrix multiplication is, in turn, used inside higher level routines, which delegate their efficient execution to the optimization of the lower level routine. Experimental results are satisfactory in different multicore+multiGPU and multicore+multiMIC systems. So, the guided search of execution configurations for satisfactory execution times proves to be a useful tool for heterogeneous systems, where the complexity of the system means a correct use of highly efficient routines and libraries is difficult.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134040222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2