{"title":"Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations","authors":"Teng Li, Vikram K. Narayana, T. El-Ghazawi","doi":"10.1145/2597917.2597925","DOIUrl":"https://doi.org/10.1145/2597917.2597925","url":null,"abstract":"The incorporation of GPUs as co-processors has brought forth significant performance improvements for High-Performance Computing (HPC). Efficient utilization of the GPU resources is thus an important consideration for computer scientists. In order to obtain the required performance while limiting the energy consumption, researchers and vendors alike are seeking to apply traditional CPU approaches into the GPU computing domain. For instance, newer NVIDIA GPUs now support concurrent execution of independent kernels as well as Dynamic Voltage and Frequency Scaling (DVFS). Amidst these new developments, we are faced with new opportunities for efficiently scheduling GPU computational kernels under performance and energy constraints. In this paper, we carry out performance and energy optimizations geared towards the execution phases of concurrent kernels in GPU-based computing. When multiple GPU kernels are enqueued for concurrent execution, the sequence in which they are initiated can significantly affect the total execution time and the energy consumption. We attribute this behavior to the relative synergy among kernels that are launched within close proximity of each other. Accordingly, we define metrics for computing the extent to which kernels are symbiotic, by modeling their complementary resource requirements and execution characteristics. We then propose a symbiotic scheduling algorithm to obtain the best possible kernel launch sequence for concurrent execution. Experimental results on the latest NVIDIA K20 GPU demonstrate the efficacy of our proposed algorithm-based approach, by showing near-optimal results within the solution space of both performance and energy consumption. As our further experimental study on DVFS finds that increasing the GPU frequency in general leads to improved performance and energy saving, the proposed approach reduces the necessity for over-clocking and can be readily adopted by programmers with minimal programming effort and risk.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121360905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srdjan Stipic, Vasileios Karakostas, Vesna Smiljkovic, Vladimir Gajinov, O. Unsal, A. Cristal, M. Valero
{"title":"Dynamic transaction coalescing","authors":"Srdjan Stipic, Vasileios Karakostas, Vesna Smiljkovic, Vladimir Gajinov, O. Unsal, A. Cristal, M. Valero","doi":"10.1145/2597917.2597930","DOIUrl":"https://doi.org/10.1145/2597917.2597930","url":null,"abstract":"Prior work in Software Transactional Memory has identified high overheads related to starting and committing transactions that may degrade the application performance. To amortize these overheads, transaction coalescing techniques have been proposed that coalesce two or more small transactions into one large transaction. However, these techniques either coalesce transactions statically at compile time, or lack on-line profiling mechanisms that allow coalescing transactions dynamically. Thus, such approaches lead to sub-optimal execution or they may even degrade the performance. In this paper, we introduce Dynamic Transaction Coalescing (DTC), a compile-time and run-time technique that improves transactional throughput. DTC reduces the overheads of starting and committing a transaction. At compile-time, DTC generates several code paths with a different number of coalesced transactions. At runtime, DTC implements low overhead online profiling and dynamically selects the corresponding code path that improves throughput. Compared to coalescing transactions statically, DTC provides two main improvements. First, DTC implements online profiling which removes the dependency on a pre-compilation profiling step. Second, DTC dynamically selects the best transaction granularity to improve the transaction throughput taking into consideration the abort rate. We evaluate DTC using common TM benchmarks and micro-benchmarks. Our findings show that: (i) DTC performs like static transaction coalescing in the common case, (ii) DTC does not suffer from performance degradation, and (iii) DTC outperforms static transaction coalescing when an application exposes phased behavior.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TFluxSCC: a case study for exploiting performance in future many-core systems","authors":"Andreas Diavastos, Giannos Stylianou, P. Trancoso","doi":"10.1145/2597917.2597953","DOIUrl":"https://doi.org/10.1145/2597917.2597953","url":null,"abstract":"The number of computational units integrated in a single processor is rapidly increasing. This suggests that applications will require efficient and effective ways to exploit the parallelism to achieve the performance offered by large-scale multicore processors. The efficient parallelization of the applications relies on the programming and execution models. On the one hand, the programming model must address the effort needed to extract parallelism for such processors. On the other hand, the execution model must handle the high levels of parallelism from the applications while efficiently exploiting the resources of the processors. In this work we use the Data-Flow model to achieve high levels of parallelism in an effort to scale the performance on the 48-core Intel Single-chip Cloud Computing (SCC) processor. We propose TFluxSCC, a software platform for execution of Data-Flow applications on the Intel SCC processor. TFluxSCC is based on the TFlux Data-Driven Multithreading (DDM) platform that was developed for commodity multicore systems. What we propose in this work is an efficient implementation of the DDM model on a clustered many-core that is used as a case study to achieve high degree of parallelism. With TFluxSCC we achieve scalable performance in a cluster of many simple cores using global address space without the need of cache-coherency support. Our scalability study shows that applications can scale, with speedup results ranging from 30x to 48x for 48 cores. The findings of this work provide insight towards what a Data-Flow implementation requires from many-cores and what it can offer to these processors in order to scale the performance.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132332739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power availability provisioning in large data centers","authors":"S. Sankar, D. Gauthier, S. Gurumurthi","doi":"10.1145/2597917.2597920","DOIUrl":"https://doi.org/10.1145/2597917.2597920","url":null,"abstract":"Enterprise data centers are provisioned with conservative redundancies built into their power infrastructures to handle failures. Conservative over-provisioning of power capacity for availability reasons results in significant capital investment for large enterprises because this capacity is designed for failure conditions that do not happen often. On the other hand, underprovisioning this capacity runs the risk of affecting the performance of the data center when failures do happen, through either service unavailability or degraded service performance. Hence, there are interactions and tradeoffs between power capacity utilization, power redundancy, and data center performance that is often overlooked. Our work proposes a provisioning methodology for the power delivery infrastructure called power availability provisioning that addresses this challenge. We provide observations on power infrastructure design based on industry experience operating large data centers. We characterize power availability events, motivate the need for workload-driven power availability provisioning, and describe a methodology to estimate performance impact due to power availability events. We then present an unconventional redundancy technique (N-M redundancy) that proposes reducing redundant power equipment, leveraging observations from our study.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting localized OpenVX kernel execution for efficient computer vision application development on STHORM many-core platform","authors":"Giuseppe Tagliavini, Germain Haugou, L. Benini","doi":"10.1145/2597917.2597947","DOIUrl":"https://doi.org/10.1145/2597917.2597947","url":null,"abstract":"Nowadays Embedded Computer Vision (ECV) is considered a technology enabler for next generation killer apps, and scientific and industrial communities are showing a growing interest in developing applications on high-end embedded systems. Modern many-core accelerators are a promising target for running common ECV algorithms, since their architectural features are particularly suitable in terms of data access patterns and program control flow. In this work we propose a set of software optimization techniques, mainly based on data tiling and local buffering policies, which are specifically targeted to accelerate the execution of OpenVX-based ECV applications by exploiting the memory hierarchy of STHORM many-core accelerator.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122969563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Object-centric bank partition for reducing memory interference in CMP systems","authors":"Qi Zhong, Jing Wang, Keyi Wang","doi":"10.1145/2597917.2597949","DOIUrl":"https://doi.org/10.1145/2597917.2597949","url":null,"abstract":"This work introduces a novel object-centric bank partition (OBP) to mitigate both the inter-thread and intra-thread interference. The key idea is to break bank sharing relationship among the simultaneously accessed data objects, instead of only focusing on the co-running threads. At sampling runs, we profile each thread to identify the simultaneously accessed objects. At actual runs, using the profiling information, the operating system partition banks at both the thread and object level. We have implemented OBP in Linux 2.6.32 kernel and evaluated its benefits on real machines. Experimental results show that OBP achieves an encouraging enhancement it terms of performance.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121600846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent page migration for mobile systems with OS-managed hybrid memory","authors":"S. Bock, B. Childers, R. Melhem, D. Mossé","doi":"10.1145/2597917.2597924","DOIUrl":"https://doi.org/10.1145/2597917.2597924","url":null,"abstract":"Mobile systems are executing applications with increasingly large memory footprints on more processor cores. New execution paradigms for quickly suspending and resuming an application have also become common. Energy consumption remains a paramount concern. Consequently, phase-change memory (PCM) has been suggested for main memory to increase capacity, provide non-volatility for suspend/resume and decrease energy consumption. Because it has limitations for writes, a large PCM is often used along with a small DRAM for good performance. The two memory types may be managed by the operating system, which selects where to allocate pages and schedules background migrations between memory types to move data. To ensure correctness, an application that writes to a migrating page must be paused until the migration completes. Because PCM has long write latency, this situation happens frequently in hybrid memory, leading to long pauses that hurt application responsiveness and performance. This paper describes concurrent page migration (CPM) to alleviate the pauses by buffering writes to migrating pages through the last-level cache. CPM improves performance by up to 22% for single-programmed workloads (17% average) and 13% for multi-programmed workloads (8% average). The technique also preserves the energy and non-volatility benefits of hybrid main memory.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters","authors":"D. Rossi, Igor Loi, Germain Haugou, L. Benini","doi":"10.1145/2597917.2597922","DOIUrl":"https://doi.org/10.1145/2597917.2597922","url":null,"abstract":"The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129147138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Milan Stanic, Oscar Palomar, Ivan Ratković, M. Duric, O. Unsal, A. Cristal
{"title":"VALib and SimpleVector: tools for rapid initial research on vector architectures","authors":"Milan Stanic, Oscar Palomar, Ivan Ratković, M. Duric, O. Unsal, A. Cristal","doi":"10.1145/2597917.2597919","DOIUrl":"https://doi.org/10.1145/2597917.2597919","url":null,"abstract":"Vector architectures have been traditionally applied to the supercomputing domain with many successful incarnations. The energy efficiency and high performance of vector processors, as well as their applicability in other emerging domains, encourage pursuing further research on vector architectures. However, there is a lack of appropriate tools to perform this research. This paper presents two tools for measuring and analyzing an application's suitability for vector microarchitectures. The first tool is VALib, a library that enables hand-crafted vectorization of applications and its main purpose is to collect data for detailed instruction level characterization and to generate input traces for the second tool. The second tool is SimpleVector, a fast trace-driven simulator that is used to estimate the execution time of a vectorized application on a candidate vector microarchitecture. The potential of the tools is demonstrated using six applications from emerging application domains such as speech and face recognition, video encoding, bioinformatics, machine learning and graph search. The results indicate that 63.2% to 91.1% of these contemporary applications are vectorizable. Then, over multiple use cases, we demonstrate that the tools can facilitate rapid evaluation of various vector architecture designs.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115962977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vladimir Gajinov, Srdjan Stipic, Igor Eric, O. Unsal, E. Ayguadé, A. Cristal
{"title":"DaSH: a benchmark suite for hybrid dataflow and shared memory programming models: with comparative evaluation of three hybrid dataflow models","authors":"Vladimir Gajinov, Srdjan Stipic, Igor Eric, O. Unsal, E. Ayguadé, A. Cristal","doi":"10.1145/2597917.2597942","DOIUrl":"https://doi.org/10.1145/2597917.2597942","url":null,"abstract":"The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory. Their findings have influenced the introduction of task dependency in the recently published OpenMP 4.0 standard. In this paper, we present DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. We also include sequential and shared-memory implementations based on OpenMP and TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125876615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}