Proceedings of the 11th ACM Conference on Computing Frontiers最新文献_第2页

Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations 共生调度的并发GPU内核的性能和能源优化

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597925

Teng Li, Vikram K. Narayana, T. El-Ghazawi

{"title":"Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations","authors":"Teng Li, Vikram K. Narayana, T. El-Ghazawi","doi":"10.1145/2597917.2597925","DOIUrl":"https://doi.org/10.1145/2597917.2597925","url":null,"abstract":"The incorporation of GPUs as co-processors has brought forth significant performance improvements for High-Performance Computing (HPC). Efficient utilization of the GPU resources is thus an important consideration for computer scientists. In order to obtain the required performance while limiting the energy consumption, researchers and vendors alike are seeking to apply traditional CPU approaches into the GPU computing domain. For instance, newer NVIDIA GPUs now support concurrent execution of independent kernels as well as Dynamic Voltage and Frequency Scaling (DVFS). Amidst these new developments, we are faced with new opportunities for efficiently scheduling GPU computational kernels under performance and energy constraints. In this paper, we carry out performance and energy optimizations geared towards the execution phases of concurrent kernels in GPU-based computing. When multiple GPU kernels are enqueued for concurrent execution, the sequence in which they are initiated can significantly affect the total execution time and the energy consumption. We attribute this behavior to the relative synergy among kernels that are launched within close proximity of each other. Accordingly, we define metrics for computing the extent to which kernels are symbiotic, by modeling their complementary resource requirements and execution characteristics. We then propose a symbiotic scheduling algorithm to obtain the best possible kernel launch sequence for concurrent execution. Experimental results on the latest NVIDIA K20 GPU demonstrate the efficacy of our proposed algorithm-based approach, by showing near-optimal results within the solution space of both performance and energy consumption. As our further experimental study on DVFS finds that increasing the GPU frequency in general leads to improved performance and energy saving, the proposed approach reduces the necessity for over-clocking and can be readily adopted by programmers with minimal programming effort and risk.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121360905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Dynamic transaction coalescing 动态事务合并

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597930

Srdjan Stipic, Vasileios Karakostas, Vesna Smiljkovic, Vladimir Gajinov, O. Unsal, A. Cristal, M. Valero

{"title":"Dynamic transaction coalescing","authors":"Srdjan Stipic, Vasileios Karakostas, Vesna Smiljkovic, Vladimir Gajinov, O. Unsal, A. Cristal, M. Valero","doi":"10.1145/2597917.2597930","DOIUrl":"https://doi.org/10.1145/2597917.2597930","url":null,"abstract":"Prior work in Software Transactional Memory has identified high overheads related to starting and committing transactions that may degrade the application performance. To amortize these overheads, transaction coalescing techniques have been proposed that coalesce two or more small transactions into one large transaction. However, these techniques either coalesce transactions statically at compile time, or lack on-line profiling mechanisms that allow coalescing transactions dynamically. Thus, such approaches lead to sub-optimal execution or they may even degrade the performance. In this paper, we introduce Dynamic Transaction Coalescing (DTC), a compile-time and run-time technique that improves transactional throughput. DTC reduces the overheads of starting and committing a transaction. At compile-time, DTC generates several code paths with a different number of coalesced transactions. At runtime, DTC implements low overhead online profiling and dynamically selects the corresponding code path that improves throughput. Compared to coalescing transactions statically, DTC provides two main improvements. First, DTC implements online profiling which removes the dependency on a pre-compilation profiling step. Second, DTC dynamically selects the best transaction granularity to improve the transaction throughput taking into consideration the abort rate. We evaluate DTC using common TM benchmarks and micro-benchmarks. Our findings show that: (i) DTC performs like static transaction coalescing in the common case, (ii) DTC does not suffer from performance degradation, and (iii) DTC outperforms static transaction coalescing when an application exposes phased behavior.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

TFluxSCC: a case study for exploiting performance in future many-core systems TFluxSCC:在未来的多核系统中开发性能的案例研究

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597953

Andreas Diavastos, Giannos Stylianou, P. Trancoso

{"title":"TFluxSCC: a case study for exploiting performance in future many-core systems","authors":"Andreas Diavastos, Giannos Stylianou, P. Trancoso","doi":"10.1145/2597917.2597953","DOIUrl":"https://doi.org/10.1145/2597917.2597953","url":null,"abstract":"The number of computational units integrated in a single processor is rapidly increasing. This suggests that applications will require efficient and effective ways to exploit the parallelism to achieve the performance offered by large-scale multicore processors. The efficient parallelization of the applications relies on the programming and execution models. On the one hand, the programming model must address the effort needed to extract parallelism for such processors. On the other hand, the execution model must handle the high levels of parallelism from the applications while efficiently exploiting the resources of the processors. In this work we use the Data-Flow model to achieve high levels of parallelism in an effort to scale the performance on the 48-core Intel Single-chip Cloud Computing (SCC) processor. We propose TFluxSCC, a software platform for execution of Data-Flow applications on the Intel SCC processor. TFluxSCC is based on the TFlux Data-Driven Multithreading (DDM) platform that was developed for commodity multicore systems. What we propose in this work is an efficient implementation of the DDM model on a clustered many-core that is used as a case study to achieve high degree of parallelism. With TFluxSCC we achieve scalable performance in a cluster of many simple cores using global address space without the need of cache-coherency support. Our scalability study shows that applications can scale, with speedup results ranging from 30x to 48x for 48 cores. The findings of this work provide insight towards what a Data-Flow implementation requires from many-cores and what it can offer to these processors in order to scale the performance.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132332739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Power availability provisioning in large data centers 大型数据中心的电力供应

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597920

S. Sankar, D. Gauthier, S. Gurumurthi

{"title":"Power availability provisioning in large data centers","authors":"S. Sankar, D. Gauthier, S. Gurumurthi","doi":"10.1145/2597917.2597920","DOIUrl":"https://doi.org/10.1145/2597917.2597920","url":null,"abstract":"Enterprise data centers are provisioned with conservative redundancies built into their power infrastructures to handle failures. Conservative over-provisioning of power capacity for availability reasons results in significant capital investment for large enterprises because this capacity is designed for failure conditions that do not happen often. On the other hand, underprovisioning this capacity runs the risk of affecting the performance of the data center when failures do happen, through either service unavailability or degraded service performance. Hence, there are interactions and tradeoffs between power capacity utilization, power redundancy, and data center performance that is often overlooked. Our work proposes a provisioning methodology for the power delivery infrastructure called power availability provisioning that addresses this challenge. We provide observations on power infrastructure design based on industry experience operating large data centers. We characterize power availability events, motivate the need for workload-driven power availability provisioning, and describe a methodology to estimate performance impact due to power availability events. We then present an unconventional redundancy technique (N-M redundancy) that proposes reducing redundant power equipment, leveraging observations from our study.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Supporting localized OpenVX kernel execution for efficient computer vision application development on STHORM many-core platform 支持本地化的OpenVX内核执行，在STHORM多核平台上进行高效的计算机视觉应用开发

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597947

Giuseppe Tagliavini, Germain Haugou, L. Benini

引用次数: 1

Object-centric bank partition for reducing memory interference in CMP systems 以对象为中心的银行分区，用于减少CMP系统中的内存干扰

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597949

Qi Zhong, Jing Wang, Keyi Wang

引用次数: 0

Concurrent page migration for mobile systems with OS-managed hybrid memory 具有操作系统管理的混合内存的移动系统的并发页面迁移

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597924

S. Bock, B. Childers, R. Melhem, D. Mossé

{"title":"Concurrent page migration for mobile systems with OS-managed hybrid memory","authors":"S. Bock, B. Childers, R. Melhem, D. Mossé","doi":"10.1145/2597917.2597924","DOIUrl":"https://doi.org/10.1145/2597917.2597924","url":null,"abstract":"Mobile systems are executing applications with increasingly large memory footprints on more processor cores. New execution paradigms for quickly suspending and resuming an application have also become common. Energy consumption remains a paramount concern. Consequently, phase-change memory (PCM) has been suggested for main memory to increase capacity, provide non-volatility for suspend/resume and decrease energy consumption. Because it has limitations for writes, a large PCM is often used along with a small DRAM for good performance. The two memory types may be managed by the operating system, which selects where to allocate pages and schedules background migrations between memory types to move data. To ensure correctness, an application that writes to a migrating page must be paused until the migration completes. Because PCM has long write latency, this situation happens frequently in hybrid memory, leading to long pauses that hurt application responsiveness and performance. This paper describes concurrent page migration (CPM) to alleviate the pauses by buffering writes to migrating pages through the last-level cache. CPM improves performance by up to 22% for single-programmed workloads (17% average) and 13% for multi-programmed workloads (8% average). The technique also preserves the energy and non-volatility benefits of hybrid main memory.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123032038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters 用于紧耦合多核集群的超低延迟轻量级DMA

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597922

D. Rossi, Igor Loi, Germain Haugou, L. Benini

{"title":"Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters","authors":"D. Rossi, Igor Loi, Germain Haugou, L. Benini","doi":"10.1145/2597917.2597922","DOIUrl":"https://doi.org/10.1145/2597917.2597922","url":null,"abstract":"The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129147138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

VALib and SimpleVector: tools for rapid initial research on vector architectures VALib和SimpleVector:用于矢量架构快速初始研究的工具

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597919

Milan Stanic, Oscar Palomar, Ivan Ratković, M. Duric, O. Unsal, A. Cristal

{"title":"VALib and SimpleVector: tools for rapid initial research on vector architectures","authors":"Milan Stanic, Oscar Palomar, Ivan Ratković, M. Duric, O. Unsal, A. Cristal","doi":"10.1145/2597917.2597919","DOIUrl":"https://doi.org/10.1145/2597917.2597919","url":null,"abstract":"Vector architectures have been traditionally applied to the supercomputing domain with many successful incarnations. The energy efficiency and high performance of vector processors, as well as their applicability in other emerging domains, encourage pursuing further research on vector architectures. However, there is a lack of appropriate tools to perform this research. This paper presents two tools for measuring and analyzing an application's suitability for vector microarchitectures. The first tool is VALib, a library that enables hand-crafted vectorization of applications and its main purpose is to collect data for detailed instruction level characterization and to generate input traces for the second tool. The second tool is SimpleVector, a fast trace-driven simulator that is used to estimate the execution time of a vectorized application on a candidate vector microarchitecture. The potential of the tools is demonstrated using six applications from emerging application domains such as speech and face recognition, video encoding, bioinformatics, machine learning and graph search. The results indicate that 63.2% to 91.1% of these contemporary applications are vectorizable. Then, over multiple use cases, we demonstrate that the tools can facilitate rapid evaluation of various vector architecture designs.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115962977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

DaSH: a benchmark suite for hybrid dataflow and shared memory programming models: with comparative evaluation of three hybrid dataflow models DaSH:混合数据流和共享内存编程模型的基准套件:对三种混合数据流模型进行比较评估

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI: 10.1145/2597917.2597942

Vladimir Gajinov, Srdjan Stipic, Igor Eric, O. Unsal, E. Ayguadé, A. Cristal

{"title":"DaSH: a benchmark suite for hybrid dataflow and shared memory programming models: with comparative evaluation of three hybrid dataflow models","authors":"Vladimir Gajinov, Srdjan Stipic, Igor Eric, O. Unsal, E. Ayguadé, A. Cristal","doi":"10.1145/2597917.2597942","DOIUrl":"https://doi.org/10.1145/2597917.2597942","url":null,"abstract":"The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory. Their findings have influenced the introduction of task dependency in the recently published OpenMP 4.0 standard. In this paper, we present DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. We also include sequential and shared-memory implementations based on OpenMP and TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125876615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18