{"title":"Software-managed Cache Coherence for fast One-Sided Communication","authors":"Steffen Christgau, Bettina Schnor","doi":"10.1145/2883404.2883409","DOIUrl":"https://doi.org/10.1145/2883404.2883409","url":null,"abstract":"The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication can be implemented on a non-cache coherent many-core CPU. The Intel SCC serves as an exemplary hardware architecture. The presented approach is based on software-managed cache coherence for MPI one-sided communication. The prototype implementation delivers a PUT performance of up to five times faster than the default message-based approach and reveals a reduction of the communication costs for the NPB 3D FFT by a factor of five. Further, the paper identifies drawbacks of the SCC's architecture and derives conclusions for future architectures.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134282722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis","authors":"Peng Di, Yulei Sui","doi":"10.1145/2883404.2883405","DOIUrl":"https://doi.org/10.1145/2883404.2883405","url":null,"abstract":"Precise dynamic race detectors report an error if and only if more than one thread concurrently exhibits conflict on a memory access. They insert instrumentations at compile-time to perform runtime checks on all memory accesses to ensure that all races are captured and no spurious warnings are generated. However, a dynamic race check for a particular memory access statement is guaranteed to be redundant if the statement can be statically identified as thread interference-free. Despite significant recent advances in dynamic detection techniques, the redundant check remains a critical factor that leads to prohibitive overhead of dynamic race detection for multithreaded programs. In this paper, we present a new framework that eliminates redundant race check and boosts the dynamic race detection by performing static optimizations on top of a series of thread interference analysis phases. Our framework is implemented on top of LLVM 3.5.0 and evaluated with an industry dynamic race detector TSAN which is available as a part of LLVM tool chain. 11 benchmarks from SPLASH2 are used to evaluate the effectiveness of our approach in accelerating TSAN by eliminating redundant interference-free checks. The experimental result demonstrates our new approach achieves from 1.4x to 4.0x (2.4x on average) speedup over original TSAN under 4 threads setting, and achieves from 1.3x to 4.6x (2.6x on average) speedup under 16 threads setting.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117289976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flow Driven GPGPU Programming combining Textual and Graphical Programming","authors":"Thomas Hoegg, G. Fiedler, C. Koehler, A. Kolb","doi":"10.1145/2883404.2883412","DOIUrl":"https://doi.org/10.1145/2883404.2883412","url":null,"abstract":"GPGPUs (General Purpose Computation on Graphics Processing Unit) have become the most important invention in the last years in computer graphics and the vision domain. Despite improvement of the two main programming platforms, CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language), GPGPU programming and development is still a complex, time consuming and error-prone task. To overcome these problems for general software engineering, the graphical modeling language UML (Unified Modeling Language) was introduced and became the first choice for designing software systems. However, its generic design causes representations of algorithmic problem descriptions to be either limited or too complicated. We present GU-DSL, a novel domain-specific language (DSL), including novel modeling concepts (new activity-diagram node types and special language constructs), based on Eclipse Xtext and GMF, adopting and extending class- and activity-diagrams in a textual and graphical form. Furthermore, we present a C++ and OpenCL code generation framework in combination with a heterogeneous C++ GPGPU computing framework allowing for a smooth connection with our DSL and graphical editors.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121733863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David del Rio Astorga, M. F. Dolz, Luis Miguel Sánchez, José Daniel García Sánchez
{"title":"Discovering Pipeline Parallel Patterns in Sequential Legacy C++ Codes","authors":"David del Rio Astorga, M. F. Dolz, Luis Miguel Sánchez, José Daniel García Sánchez","doi":"10.1145/2883404.2883411","DOIUrl":"https://doi.org/10.1145/2883404.2883411","url":null,"abstract":"Since free performance lunch of processors is over, parallelism has become the new trend in hardware and architecture design. However, parallel resources deployed in data centers are underused in many cases, given that sequential programming is still deeply rooted in current software development. To face this problem, new methodologies and techniques for parallel programming have been progressively developed. For instance, parallel frameworks offer programming skeletons that allow expressing parallelism and concurrency in applications to better exploit concurrent hardware. Nevertheless, it remains a large portion of production software, coming from a broad range of scientific and industrial areas, that still execute sequential legacy codes. Taking into account that these software modules contain thousands, or even millions, of code lines, the effort needed to identify parallel regions is extremely high. To pave the way in this area, this paper presents Parallel Pattern Analyzer Tool (PPAT), a software component that aids discovering and annotating parallel patterns in source codes. Hence, facilitating the transformation of sequential code into parallel. We evaluate this tool for the special case of parallel pipelines using a series of well-known sequential benchmark suites.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129164492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multitasking Real-time Embedded GPU Computing Tasks","authors":"Pınar Muyan-Özçelik, John Douglas Owens","doi":"10.1145/2883404.2883408","DOIUrl":"https://doi.org/10.1145/2883404.2883408","url":null,"abstract":"In this study, we consider the specific characteristics of workloads that involve multiple real-time embedded GPU computing tasks and design several schedulers that use alternative approaches. Then, we compare the performance of schedulers and determine which scheduling approach is more effective for a given workload and why. The major conclusions of this study include: (a) Small kernels benefit from running kernels concurrently. (b) The combination of small kernels, high-priority kernels with longer runtimes, and lower-priority kernels with shorter runtimes benefits from a CPU scheduler that dynamically changes kernel order on the Fermi architecture. (c) Due to limitations of existing GPU architectures, currently CPU schedulers outperform their GPU counterparts. We also highlight the shortcomings of current GPU architectures with regard to running multiple real-time tasks, and recommend new features that would improve scheduling, including hardware priorities, preemption, programmable scheduling, and a common time concept and atomics across the CPU and GPU.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124776939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Metaheuristic-based Virtual Screening Methods on Massively Parallel and Heterogeneous Systems","authors":"Baldomero Imbernón, J. Cecilia, D. Giménez","doi":"10.1145/2883404.2883413","DOIUrl":"https://doi.org/10.1145/2883404.2883413","url":null,"abstract":"Molecular docking through Virtual Screening is an optimization problem which can be approached with metaheuristic methods. The interaction between two chemical compounds (typically a protein or receptor and small molecule or ligand) is measured with computationally very demanding scoring functions and can, moreover, be measured at several spots throughout the receptor. For the simulation of large molecules, it is necessary to scale to large clusters to deal with memory and computational requirements. In this paper, we analyze the current landscape of computation, where massive parallelism and heterogeneity are today the main ingredients in large-scale computing systems, to enhance metaheuristic-based virtual screening methods, and thus facilitate the analysis of large molecules. We provide a parallelization strategy aimed at leveraging these features. Our solution finds a good workload balance via dynamic assignment of jobs to heterogeneous resources which perform independent metaheuristic executions under different molecular interactions. A cooperative scheduling of jobs optimizes the quality of the solution and the overall performance of the simulation, so opening a new path for further developments of Virtual Screening methods on high-performance contemporary heterogeneous platforms.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128559508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. F. Dolz, David del Rio Astorga, Javier Fernández, José Daniel García Sánchez, Félix García Carballeira, M. Danelutto, M. Torquati
{"title":"Embedding Semantics of the Single-Producer/Single-Consumer Lock-Free Queue into a Race Detection Tool","authors":"M. F. Dolz, David del Rio Astorga, Javier Fernández, José Daniel García Sánchez, Félix García Carballeira, M. Danelutto, M. Torquati","doi":"10.1145/2883404.2883406","DOIUrl":"https://doi.org/10.1145/2883404.2883406","url":null,"abstract":"The rapid progress of multi-/many-core architectures has caused data-intensive parallel applications not yet be fully suited for getting the maximum performance. The advent of parallel programming frameworks offering structured patterns has alleviated developers' burden adapting such applications to parallel platforms. For example, the use of synchronization mechanisms in multithreaded applications is essential on shared-cache multi-core architectures. However, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels may influence the occurrence of data races. The benefits of race detectors are formidable in this sense, nevertheless if lock-free data structures with no high-level atomics are used, they may emit false positives. In this paper, we extend the ThreadSanitizer race detection tool in order to support semantics of the general Single-Producer/Single-Consumer (SPSC) lock-free parallel queue and to detect benign data races where it was correctly used. To perform our analysis, we leverage the FastFlow SPSC bounded lock-free queue implementation to test our extensions over a set of μ-benchmarks and real applications on a dual-socket Intel Xeon CPU E5-2695 platform. We demonstrate that this approach can reduce, on average, 30% the number of data race warning messages.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128852891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Kienberger, Christian Saad, Stefan Kuntz, B. Bauer
{"title":"Efficient Parallelization of Complex Automotive Systems","authors":"Julian Kienberger, Christian Saad, Stefan Kuntz, B. Bauer","doi":"10.1145/2883404.2883421","DOIUrl":"https://doi.org/10.1145/2883404.2883421","url":null,"abstract":"As the automotive industry seeks to include more and more features in its vehicles while simultaneously attempting to reduce the number of \"Electronic Control Units\" (ECUs) that execute the corresponding embedded software, the necessary policy shift towards multi-core technology is in full swing. In order to eventually exploit the extra processing power, there is much additional effort needed for coping with the tremendously increased complexity of such systems. This is largely due to the elaborate parallelization process (partitioning, mapping and scheduling software parts as tasks on different cores) that results in a combinatorial explosion and thus spans a vast search space. Mastering this challenge requires innovative methods and appropriate tools that are specifically designed for the creation of embedded multi-core applications or the migration of legacy software [16]. On the basis of the concept presented in [25], we use the results of its data dependency analysis performed on an \"AUTOSAR\" model (AUTOSAR system descriptions) to determine advantageous partitions as well as initial task-to-core mappings. Afterwards, the extracted information serves as input for the simulation within an embedded multi-core timing tool suite. Here, the initial solution is evaluated with respect to the fulfillment of basic timing requirements and metrics like cross-core communication rates, average latencies or core workloads. A subsequent optimization process improves the initial solution and enables a comparative assessment. In order to demonstrate the benefit of this approach, we apply it to two models -- a fictional mid-sized and a real-life complex one -- and show the advantage compared to a parallelization process without preceding dependency analysis and initial partition/mapping suggestions.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130467595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matt Martineau, Simon McIntosh-Smith, M. Boulton, W. Gaudin
{"title":"An Evaluation of Emerging Many-Core Parallel Programming Models","authors":"Matt Martineau, Simon McIntosh-Smith, M. Boulton, W. Gaudin","doi":"10.1145/2883404.2883420","DOIUrl":"https://doi.org/10.1145/2883404.2883420","url":null,"abstract":"In this work we directly evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port TeaLeaf, a miniature proxy application, or mini-app, that solves the heat conduction equation, and belongs to the Mantevo suite of applications. We find that the best performance is achieved with device-tuned implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5-20% performance penalty. The models expose varying levels of complexity to the developer, and they all present reasonable performance. We believe that complexity will become the major influencer in the long-term adoption of such models.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130086151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis-Pedro García, J. Cuenca, Francisco-José Herrera, D. Giménez
{"title":"On Guided Installation of Basic Linear Algebra Routines in Nodes with Manycore Components","authors":"Luis-Pedro García, J. Cuenca, Francisco-José Herrera, D. Giménez","doi":"10.1145/2883404.2883422","DOIUrl":"https://doi.org/10.1145/2883404.2883422","url":null,"abstract":"Computational systems are nowadays composed of basic computational components which share multiprocessors and coprocessors of different types, typically several GPUs or MICs. The software previously developed and optimized for simpler systems needs to be redesigned and re-optimized for these new, more complex systems. The adaptation to hybrid multicore+multiGPU and multicore+multiMIC of auto-tuning techniques for basic linear algebra routines is analyzed. The matrix-matrix multiplication kernel, which is optimized for different computational system components through guided experimentation, is studied. The basic matrix-matrix multiplication is, in turn, used inside higher level routines, which delegate their efficient execution to the optimization of the lower level routine. Experimental results are satisfactory in different multicore+multiGPU and multicore+multiMIC systems. So, the guided search of execution configurations for satisfactory execution times proves to be a useful tool for heterogeneous systems, where the complexity of the system means a correct use of highly efficient routines and libraries is difficult.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134040222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}