2017 IEEE International Symposium on Workload Characterization (IISWC)最新文献

筛选
英文 中文
Evaluating energy storage for a multitude of uses in the datacenter 评估能源存储在数据中心的多种用途
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167752
Iyswarya Narayanan, Di Wang, A. Mamun, A. Sivasubramaniam, H. Fathy, Sean James
{"title":"Evaluating energy storage for a multitude of uses in the datacenter","authors":"Iyswarya Narayanan, Di Wang, A. Mamun, A. Sivasubramaniam, H. Fathy, Sean James","doi":"10.1109/IISWC.2017.8167752","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167752","url":null,"abstract":"Datacenters often are a power utility's largest consumers, and are expected to participate in several power management scenarios with diverse characteristics in which Energy Storage Devices (ESDs) are expected to play important roles. Different ESD technologies exist, including little explored technologies such as flow batteries, that offer different performance characteristics in cost, size, and environmental impact. While prior works in datacenter ESD literature have considered one of usage aspect, technology, performance metric (typically cost), the whole three-dimensional space is little explored. Towards understanding this design space, this paper presents first such study towards joint characterization of ESD usages based on their provisioning and operating demands, under ideal and realistic ESD technologies, and quantify their impact on datacenter performance. We expect our work can help datacenter operators to characterize this three-dimensional space in a systematic manner, and make design decisions targeted towards cost-effective and environmental impact aware datacenter energy management.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126058798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Determining work partitioning on closely coupled heterogeneous computing systems using statistical design of experiments 用实验统计设计确定紧密耦合异构计算系统的工作划分
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167766
Yectli A. Huerta, Brent A. Swartz, D. Lilja
{"title":"Determining work partitioning on closely coupled heterogeneous computing systems using statistical design of experiments","authors":"Yectli A. Huerta, Brent A. Swartz, D. Lilja","doi":"10.1109/IISWC.2017.8167766","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167766","url":null,"abstract":"In a closely coupled heterogeneous computing system the work is shared amongst all available computing resources. One challenge is to find an optimal division of work between the two or more very different kinds of processing units, each with their own optimal settings. We show that through the use of statistical techniques, a systematic search of the parameter space can be conducted. These techniques can be applied to variables that are categorical or continuous in nature and do not rely on the standard assumptions of linear models, mainly that the response variable can be described as a linear combination of the regression coefficients. Our search technique, when applied to the HPL benchmark, resulted in a performance gain of 14.5% over previously reported results.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130508671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Congestion-aware memory management on NUMA platforms: A VMware ESXi case study NUMA平台上的拥塞感知内存管理:一个VMware ESXi案例研究
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167772
Jagadish B. Kotra, Seongbeom Kim, Kamesh Madduri, M. Kandemir
{"title":"Congestion-aware memory management on NUMA platforms: A VMware ESXi case study","authors":"Jagadish B. Kotra, Seongbeom Kim, Kamesh Madduri, M. Kandemir","doi":"10.1109/IISWC.2017.8167772","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167772","url":null,"abstract":"He VMware ESXi hypervisor attracts a wide range of customers and is deployed in domains ranging from desktop computing to server computing. While the software systems are increasingly moving towards consolidation, hardware has already transitioned into multi-socket Non-Uniform Memory Access (NUMA)-based systems. The marriage of increasing consolidation and the multi-socket based systems warrants low-overhead, simple and practical mechanisms to detect and address performance bottlenecks, without causing additional contention for shared resources such as performance counters. In this paper, we propose a simple, practical and highly accurate, dynamic memory latency probing mechanism to detect memory congestion in a NUMA system. Using these dynamic probed latencies, we propose congestion-aware memory allocation, congestion-aware memory migration, and a combination of these two techniques. These proposals, evaluated on Intel Westmere (8 nodes) and Intel Haswell (2 nodes) using various workloads, improve the overall performance on an average by 7.2% and 9.5% respectively.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"84 19","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120824824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Work as a team or individual: Characterizing the system-level impacts of main memory partitioning 作为团队或个人工作:描述主内存分区的系统级影响
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167773
Eojin Lee, Jongwook Chung, Daejin Jung, Sukhan Lee, Sheng Li, Jung Ho Ahn
{"title":"Work as a team or individual: Characterizing the system-level impacts of main memory partitioning","authors":"Eojin Lee, Jongwook Chung, Daejin Jung, Sukhan Lee, Sheng Li, Jung Ho Ahn","doi":"10.1109/IISWC.2017.8167773","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167773","url":null,"abstract":"Modern multi-core systems employ shared memory architecture, entailing problems related to the main memory such as row-buffer conflicts, time-varying hot-spots across memory channels, and superfluous switches between reads and writes originating from different cores. There have been proposals to solve these problems by partitioning main memory across banks and/or channels such that a DRAM bank is dedicated to a single core, being free from inter-thread row-buffer conflicts. However, those studies either focused on only multi-programmed workloads on which cores operate independently, not cooperatively, or specific hardware configurations with a limited number of degrees of freedom in the number of main memory banks, ranks, and channels. We analyze the influence of memory partitioning on systems with various degrees of banks, ranks, and channels using multi-threaded and multi-programmed workloads, making the following key observations. Bank partitioning is beneficial when memory-intensive applications in a multi-programmed workload have similar characteristics in bank-level parallelism, bandwidth, and capacity demands. Any diversity in these demands with a limited memory capacity greatly diminishes the bank partitioning benefits. As memory access/usage patterns across cores are more easily manageable on multi-threaded workloads, bank partitioning is more often effective with memory intensive multithreaded applications. Channel partitioning becomes effective when the reduction of the negative impacts of time-varying hotspots across memory channels outweighs the load imbalance due to partitioning. We also demonstrate the benefits of rank partitioning with regard to minimizing read-write switches on multi-threaded applications where cores can coordinate memory accesses.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133604141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
LORE: A loop repository for the evaluation of compilers 用于编译器求值的循环存储库
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167779
Zhi Chen, Zhangxiaowen Gong, J. Szaday, D. Wong, D. Padua, A. Nicolau, A. Veidenbaum, Neftali Watkinson, Zehra Sura, Saeed Maleki, J. Torrellas, G. DeJong
{"title":"LORE: A loop repository for the evaluation of compilers","authors":"Zhi Chen, Zhangxiaowen Gong, J. Szaday, D. Wong, D. Padua, A. Nicolau, A. Veidenbaum, Neftali Watkinson, Zehra Sura, Saeed Maleki, J. Torrellas, G. DeJong","doi":"10.1109/IISWC.2017.8167779","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167779","url":null,"abstract":"Although numerous loop optimization techniques have been designed and deployed in commercial compilers in the past, virtually no common experimental infrastructure nor repository exists to help the compiler community evaluate the effectiveness of these techniques. This paper describes a repository, LORE, that maintains a large number of C language for loop nests extracted from popular benchmarks, libraries, and real applications. It also describes the infrastructure that builds and maintains the repository. Each loop nest in the repository has been compiled, transformed, executed, and measured independently. These loops cover a variety of properties that can be used by the compiler community to evaluate loop optimizations using a broad and representative collection of loops. To illustrate the usefulness of the repository, we also present two example applications. One is assessing the capabilities of the auto-vectorization features of three widely used compilers. The other is measuring the performance difference of a compiler across different versions. These applications prove that the repository is valuable for identifying the strengths and weaknesses of a compiler and for quantitatively measuring the evolution of a compiler.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129933756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Exploring the impact of memory block permutation on performance of a crossbar ReRAM main memory 探讨内存块排列对交叉条ReRAM主存储器性能的影响
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167774
M. Ramezani, Nima Elyasi, M. Arjomand, M. Kandemir, A. Sivasubramaniam
{"title":"Exploring the impact of memory block permutation on performance of a crossbar ReRAM main memory","authors":"M. Ramezani, Nima Elyasi, M. Arjomand, M. Kandemir, A. Sivasubramaniam","doi":"10.1109/IISWC.2017.8167774","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167774","url":null,"abstract":"Owing to the advantages of low standby power and high scalability, ReRAM technology is considered as a promising replacement for conventional DRAM in future manycore systems. In order to make ReRAM highly scalable, the memory array has to have a crossbar array structure, which needs a specific access mechanism for activating a row of memory when reading/writing a data block from/to it. This type of memory access would cause Sneak Current that would lead to voltage drop on the memory cells of the activated row, i.e., the cells which are far from the write drivers experience more voltage drop compared to those close to them. This results in a nonuniform access latency for the cells of the same row. To address this problem, we propose and evaluate a scheme that exploits the non-uniformity of write access pattern of the workloads. More specifically, based on our extensive characterization of write patterns to the cache lines and memory pages of 20 CPU workloads, we recognized that (i) on each main memory access, just a few cache lines of the activated row need to be updated on a write-back, and more importantly, there is a temporal and spatial locality of the writes to the cache lines; and (ii) all pages of the memory footprint of an application do not see the same write counts during the execution of the workload. Motivated by these characteristics, we then evaluate different intra-page memory block permutations in order to improve the performance of a crossbar ReRAM-based main memory. Our results collectively show that, by applying some types of intra-page memory block permutation, the access latency to a ReRAM-based main memory can be reduced up to 50% when running the SPEC CPU2006 workloads.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121504935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The microsoft catapult project 微软弹射器项目
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167769
Derek Chiou
{"title":"The microsoft catapult project","authors":"Derek Chiou","doi":"10.1109/IISWC.2017.8167769","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167769","url":null,"abstract":"All new Microsoft Azure and Bing servers are being deployed with an FPGA that sits both between the server and the data center network and on the PCIe bus. The FPGA is currently being used to accelerate networking on Azure machines and search on Bing machines, but could very quickly and easily be retargeted to other uses as needed. In this talk, I will describe how we decided on this architecture, the new data center model it introduces, and the benefits it provides.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122851765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Approximeter: Automatically finding and quantifying code sections for approximation 近似值:自动查找和量化近似代码段
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167765
Riad Akram, A. Muzahid
{"title":"Approximeter: Automatically finding and quantifying code sections for approximation","authors":"Riad Akram, A. Muzahid","doi":"10.1109/IISWC.2017.8167765","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167765","url":null,"abstract":"Approximate computing is getting a lot of traction especially for its potential in improving power, performance, and scalability of a computing system. However, prior work heavily relies upon a programmer to identify code sections where various approximation techniques can be applied. Such an approach is error prone and cannot scale well beyond small applications. In this paper, we contribute with a tool, called Approximeter, to automatically identify and quantify code sections where approximation can be used and to what extant. The tool works by first identifying potential approximable functions and then, injecting errors at appropriate locations. The tool runs Monte Carlo experiments to quantify statistical relation between injected error and corresponding output accuracy. The tool also provides a rough estimate of potential performance gain from approximating a certain function. Finally, it ranks the approximable functions based on their error tolerance and performance gain.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124410563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Understanding the performance-accuracy tradeoffs of floating-point arithmetic on GPUs 理解gpu上浮点运算的性能精度权衡
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167778
Sruthikesh Surineni, Ruidong Gu, Huyen Nguyen, M. Becchi
{"title":"Understanding the performance-accuracy tradeoffs of floating-point arithmetic on GPUs","authors":"Sruthikesh Surineni, Ruidong Gu, Huyen Nguyen, M. Becchi","doi":"10.1109/IISWC.2017.8167778","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167778","url":null,"abstract":"Floating-point computations produce approximate results, possibly leading to inaccuracy and reproducibility problems. Existing work addresses two issues: first, the design of high precision floating-point representations; second, the study of methods to trade off accuracy and performance of CPU applications. However, a comprehensive study of the tradeoffs between accuracy and performance on modern GPUs is missing. This study covers the use of different floating-point precisions (i.e., single and double floating-point precision in IEEE 754 standard, GNU Multiple Precision, and composite floating-point precision) on GPU using a variety of synthetic and real-world benchmark applications. First, we analyze the support for single and double precision floating-point arithmetic on different GPU architectures, and we characterize the latencies of all floating-point instructions on GPU. Second, we study the performance/accuracy tradeoffs related to the use of different arithmetic precisions on addition, multiplication, division, and natural exponential function. Third, we analyze the combined use of different arithmetic operations on three benchmark applications characterized by different instruction mixes and arithmetic intensities. As a result of this analysis, we provide insights to guide users to the selection of the arithmetic precision leading to a good performance/accuracy tradeoff depending on the arithmetic operations and mathematical functions used in their program and the degree of multithreading of the code.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126783781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fine-grained energy profiling for deep convolutional neural networks on the Jetson TX1 Jetson TX1上深度卷积神经网络的细粒度能量分析
2017 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2017-10-01 DOI: 10.1109/IISWC.2017.8167764
Crefeda Faviola Rodrigues, G. Riley, M. Luján
{"title":"Fine-grained energy profiling for deep convolutional neural networks on the Jetson TX1","authors":"Crefeda Faviola Rodrigues, G. Riley, M. Luján","doi":"10.1109/IISWC.2017.8167764","DOIUrl":"https://doi.org/10.1109/IISWC.2017.8167764","url":null,"abstract":"Energy-use is a key concern when migrating current deep learning applications onto low power heterogeneous devices such as a mobile device. This is because deep neural networks are typically designed and trained on high-end GPUs or servers and require additional processing steps to deploy them on low power devices. Such steps include the use of compression techniques to scale down the network size or the provision of efficient device-specific software implementations. Migration is further aggravated by the lack of tools and the inability to measure power and performance accurately and consistently across devices. We present a novel evaluation framework for measuring energy and performance for deep neural networks using ARMs Streamline Performance Analyser integrated with standard deep learning frameworks such as Caffe and CuDNNv5. We apply the framework to study the execution behaviour of SqueezeNet on the Maxwell GPU of the NVidia Jetson TX1, on an image classification task (also known as inference) and demonstrate the ability to measure energy of specific layers of the neural network.","PeriodicalId":110094,"journal":{"name":"2017 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125783017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信