{"title":"Characterizing multi-threaded applications for designing sharing-aware last-level cache replacement policies","authors":"R. Natarajan, Mainak Chaudhuri","doi":"10.1109/IISWC.2013.6704665","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704665","url":null,"abstract":"Recent years have seen a large volume of proposals on managing the shared last-level cache (LLC) of chip-multiprocessors (CMPs). However, most of these proposals primarily focus on reducing the amount of destructive interference between competing independent threads of multi-programmed workloads. While very few of these studies evaluate the proposed policies on shared memory multi-threaded applications, they do not improve constructive cross-thread sharing of data in the LLC In this paper, we characterize a set of multi-threaded applications drawn from the PARSEC, SPEC OMP, and SPLASH-2 suites with the goal of introducing sharing-awareness in LLC replacement policies. We motivate our characterization study by quantifying the potential contributions of the shared and the private blocks toward the overall volume of the LLC hits in these applications and show that the shared blocks are more important than the private blocks. Next, we characterize the amount of sharing-awareness enjoyed by recent proposals compared to the optimal policy. We design and evaluate a generic oracle that can be used in conjunction with any existing policy to quantify the potential improvement that can come from introducing sharing-awareness. The oracle analysis shows that introducing sharing-awareness reduces the number of LLC misses incurred by the least-recently-used (LRU) policy by 6% and 10% on average for a 4MB and 8MB LLC respectively. A realistic implementation of this oracle requires the LLC controller to have the capability to accurately predict, at the time a block is filled into the LLC, whether the block will be shared during its residency in the LLC. We explore the feasibility of designing such a predictor based on the address of the fill and the program counter of the instruction that triggers the fill. Our sharing behavior predictability study of two history-based fill-time predictors that use block addresses and program counters concludes that achieving acceptable levels of accuracy with such predictors will require other architectural and/or high-level program semantic features that have strong correlations with active sharing phases of the LLC blocks.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126824996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Di Wang, Chuangang Ren, Sriram Govindan, A. Sivasubramaniam, B. Urgaonkar, A. Kansal, Kushagra Vaid
{"title":"ACE: Abstracting, characterizing and exploiting datacenter power demands","authors":"Di Wang, Chuangang Ren, Sriram Govindan, A. Sivasubramaniam, B. Urgaonkar, A. Kansal, Kushagra Vaid","doi":"10.1109/IISWC.2013.6704669","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704669","url":null,"abstract":"Peak power management of datacenters has tremendous cost implications. While numerous mechanisms have been proposed to cap power consumption, real datacenter power consumption data is scarce. Prior studies have either used a small set of applications and/or servers, or presented data that is at an aggregate scale from which it is difficult to design and evaluate new and existing optimizations. To address this gap, we collect power measurement data at multiple spatial and fine-grained temporal resolutions from several geo-distributed datacenters of Microsoft corporation over 6 months. We conduct aggregate analysis of this data to study its statistical properties. We find evidence of self-similarity in power demands, statistical multiplexing effects, and correlations with the cooling power that caters to the IT equipment. With workload characterization a key ingredient for systems design and evaluation, we note the importance of better abstractions for capturing power demands, in the form of peaks and valleys. We identify attributes for peaks and valleys, and important correlations across these attributes that can influence the choice and effectiveness of different power capping techniques. We characterize these attributes and their correlations, showing the burstiness of small duration peaks, and the importance of not ignoring the rare but more stringent or long peaks. The correlations between peaks and valleys suggest the need for techniques to aggregate and collectively handle them. With the wide scope of exploitability of such characteristics for power provisioning and optimizations, we illustrate its benefits with two specific case studies. The first shows how peaks can be differentially handled based on our peak and valley characterization using existing approaches, rather than a one-size-fits-all solution. The second illustrates a simple capacity provisioning strategy for energy storage using the peak and valley characteristics.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131032263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Platform-independent analysis of function-level communication in workloads","authors":"Siddharth Nilakantan, Mark Hempstead","doi":"10.1109/IISWC.2013.6704685","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704685","url":null,"abstract":"The emergence of many-core and heterogeneous multicore processors has meant that data communication patterns increasingly determine application performance. Microprocessor designers need tools that can extract and represent these producer-consumer relationships for a workload to aid them in a wide range of tasks including hardware-software co-design, software partitioning, and application performance optimization. This paper presents Sigil, a profiling tool that can extract communication patterns within a workload independent of hardware characteristics. We show how our methodology can extract the true costs of communication within a workload by distinguishing between unique, local, and total communication. We describe the implementation and performance of Sigil as well as the results of several case studies.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131881555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iBench: Quantifying interference for datacenter applications","authors":"Christina Delimitrou, C. Kozyrakis","doi":"10.1109/IISWC.2013.6704667","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704667","url":null,"abstract":"Interference between co-scheduled applications is one of the major reasons that causes modern datacenters (DCs) to operate at low utilization. DC operators traditionally side-step interference either by disallowing colocation altogether and providing isolated server instances, or by requiring the users to express resource reservations, which are often exaggerated to counter-balance the unpredictability in the quality of allocated resources. Understanding, reducing and managing interference can significantly impact the manner in which these large-scale systems operate. We present iBench, a novel workload suite that helps quantify the pressure different applications put in various shared resources, and similarly the pressure they can tolerate in these resources. iBench consists of a set of carefully-crafted benchmarks that induce interference of increasing intensity in resources that span the CPU, cache hierarchy, memory, storage and networking subsystems. We first validate the effect that iBench workloads have on performance against a wide spectrum of DC applications. Then, we use iBench to demonstrate the importance of considering interference in a set of challenging problems that range from DC scheduling and server provisioning, to resource-efficient application development and scheduling for heterogeneous CMPs. In all cases quantifying interference with iBench results in significant performance and/or efficiency improvements. We plan to release iBench under a free software license.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128595921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantifying the energy cost of data movement in scientific applications","authors":"Gokcen Kestor, R. Gioiosa, D. Kerbyson, A. Hoisie","doi":"10.1109/IISWC.2013.6704670","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704670","url":null,"abstract":"In the exascale era, the energy cost of moving data across the memory hierarchy is expected to be two orders of magnitude higher than the cost of performing a double-precision floating point operation. Despite its importance, the energy cost of data movement in scientific applications has not be quantitatively evaluated even for current systems.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130010369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the performance and energy-efficiency of multi-core SIMD CPUs and CUDA-enabled GPUs","authors":"Ronald Duarte, Resit Sendag, F. J. Vetter","doi":"10.1109/IISWC.2013.6704683","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704683","url":null,"abstract":"This paper explores the performance and energy efficiency of CUDA-enabled GPUs and multi-core SIMD CPUs using a set of kernels and full applications. Our implementations efficiently exploit both SIMD and thread-level parallelism on multi-core CPUs and the computational capabilities of CUDA-enabled GPUs. We discuss general optimization techniques for our CPU-only and CPU-GPU platforms. To fairly study performance and energy-efficiency, we also used two applications which utilize several kernels. Finally, we present an evaluation of the implementation effort required to efficiently utilize multi-core SIMD CPUs and CUDA-enabled GPUs for the benchmarks studied. Our results show that kernel-only performance and energy-efficiency could be misleading when evaluating parallel hardware; therefore, true results must be obtained using full applications. We show that, after all respective optimizations have been made, the best performing and energy-efficient platform varies for different benchmarks. Finally, our results show that PPEH (Performance gain Per Effort Hours), our newly introduced metric, can affectively be used to quantify efficiency of implementation effort across different benchmarks and platforms.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126282703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Zheng, Yajing Chen, R. Dreslinski, C. Chakrabarti, A. Anastasopoulos, S. Mahlke, T. Mudge
{"title":"WiBench: An open source kernel suite for benchmarking wireless systems","authors":"Qi Zheng, Yajing Chen, R. Dreslinski, C. Chakrabarti, A. Anastasopoulos, S. Mahlke, T. Mudge","doi":"10.1109/IISWC.2013.6704678","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704678","url":null,"abstract":"The rapid growth in the number of mobile devices and the higher data rate requirements of mobile subscribers have made wireless signal processing a key driving application of mobile computing technology. To design better mobile platforms and the supporting wireless infrastructure, it is very important for computer architects and system designers to understand and characterize the performance of existing and upcoming wireless protocols. In this paper, we present a newly developed open-source benchmark suite called WiBench. It consists of a wide range of signal processing kernels used in many mainstream standards such as 802.11, WCDMA and LTE. The kernels include FFT/IFFT, MIMO, channel estimation, channel coding, constellation mapping, etc. Each kernel is a self-contained configurable block which can be tuned to meet the different system requirements. Several standard channel models have also been included to study system performance, such as the bit error rate. The suite also contains an LTE uplink system as a representative example of a wireless system that can be built using these kernels. WiBench is provided in C++ to make it easier for computer architects to profile and analyze the system. We characterize the performance of WiBench to illustrate how it can be used to guide hardware system design. Architectural analyses on each individual kernel and on the entire LTE uplink are performed, indicating the hotspots, available parallelism, and runtime performance. Finally, a MATLAB version is also included for debugging purposes.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123422532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance implications of System Management Mode","authors":"Brian Delgado, K. Karavanic","doi":"10.1109/IISWC.2013.6704682","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704682","url":null,"abstract":"System Management Mode (SMM) is a special x86 processor mode that privileged software such as kernels or hypervisors cannot access or interrupt. Previously, it has been assumed that time spent in SMM would be relatively small and therefore its side effects on privileged software were unimportant; recently, researchers have proposed uses, such as security-related checks, that would greatly increase the amount of runtime spent in this mode. We present the results of a detailed performance study to characterize the performance impacts of SMM, using measurement infrastructure we have developed. Our study includes impact to application, system, and hypervisor. We show there can be clear negative effects from prolonged preemptions. However, if SMM duration is kept within certain ranges, perturbation caused by SMIs may be kept to a minimum.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122562258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yukitaka Abe, Hiroshi Sasaki, S. Kato, Koji Inoue, M. Edahiro, M. Peres
{"title":"Power and performance of GPU-accelerated systems: A closer look","authors":"Yukitaka Abe, Hiroshi Sasaki, S. Kato, Koji Inoue, M. Edahiro, M. Peres","doi":"10.1109/IISWC.2013.6704675","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704675","url":null,"abstract":"In this paper, we have presented a characterization of power and performance for GPU-accelerated systems. We selected four different NVIDIA GPUs from three generations of the GPU architecture in order to demonstrate generality of our contribution. One of our findings is that the power efficiency characteristics differ such that the best configuration is not identical between the GPUs. This evidence encourages future work on the management of power and performance for GPU-accelerated systems to benefit from dynamic voltage and frequency scaling. In future work, we plan to develop a dynamic voltage and frequency scaling algorithm for GPU-accelerated systems.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116735396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, Chunjie Luo
{"title":"Characterizing data analysis workloads in data centers","authors":"Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, Chunjie Luo","doi":"10.1109/IISWC.2013.6704671","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704671","url":null,"abstract":"As the amount of data explodes rapidly, more and more corporations are using data centers to make effective decisions and gain a competitive edge. Data analysis applications play a significant role in data centers, and hence it has became increasingly important to understand their behaviors in order to further improve the performance of data center computer systems. In this paper, after investigating three most important application domains in terms of page views and daily visitors, we choose eleven representative data analysis workloads and characterize their micro-architectural characteristics by using hardware performance counters, in order to understand the impacts and implications of data analysis workloads on the systems equipped with modern superscalar out-of-order processors. Our study on the workloads reveals that data analysis applications share many inherent characteristics, which place them in a different class from desktop (SPEC CPU2006), HPC (HPCC), and service workloads, including traditional server workloads (SPECweb200S) and scale-out service workloads (four among six benchmarks in CloudSuite), and accordingly we give several recommendations for architecture and system optimizations. On the basis of our workload characterization work, we released a benchmark suite named DCBench for typical datacenter workloads, including data analysis and service workloads, with an open-source license on our project home page on http://prof.ict.ac.cnIDCBench. We hope that DCBench is helpful for performing architecture and small-to-medium scale system researches for datacenter computing.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124677160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}