{"title":"Revisiting the management control plane in virtualized cloud computing infrastructure","authors":"V. Soundararajan, Lawrence Spracklen","doi":"10.1109/IISWC.2013.6704680","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704680","url":null,"abstract":"Previous research in virtualization has demonstrated that the management of a virtualized datacenter is a workload by itself, over and above the applications running in the virtualized datacenter. Virtualization has become more prevalent as the backbone of various cloud computing environments, and the workflows used in cloud computing are slightly different from typical datacenter workflows. In this paper, we profile the management workload induced by cloud-computing environments. Specifically, we analyze results from two real-world self-service cloud computing setups. Our results show that using the most recent virtualization techniques for conserving data bandwidth requirements in clouds, the management control plane now becomes a significant limiting factor in deploying cloud resources. We demonstrate that the rate of VM provisioning in clouds demands more aggressive means of performing previously infrequent operations like cloud reconfiguration, and these demands may influence virtualized datacenter design.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125930863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dam Sunwoo, William Wang, Mrinmoy Ghosh, Chander Sudanthi, G. Blake, C. D. Emmons, N. Paver
{"title":"A structured approach to the simulation, analysis and characterization of smartphone applications","authors":"Dam Sunwoo, William Wang, Mrinmoy Ghosh, Chander Sudanthi, G. Blake, C. D. Emmons, N. Paver","doi":"10.1109/IISWC.2013.6704677","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704677","url":null,"abstract":"Full-system simulators are invaluable tools for designing new architectures due to their ability to simulate full applications as well as capture operating system behavior, virtual machine or hypervisor behavior, and interference between concurrently-running applications. However, the systems under investigation and applications under test have become increasingly complicated leading to prohibitively long simulation times for a single experiment. This problem is compounded when many permutations of system design parameters and workloads are tested to investigate system sensitivities and full-system effects with confidence. In this paper, we propose a methodology to tractably explore the processor design space and to characterize applications in a full-system simulation environment. We combine SimPoint, Principal Component Analysis and Fractional Factorial experimental designs to substantially reduce the simulation effort needed to characterize and analyze workloads. We also present a non-invasive user-interface automation tool to allow us to study all types of workloads in a simulation environment. While our methodology is generally applicable to many simulators and workloads, we demonstrate the application of our proposed flow on smartphone applications running on the Android operating system within the gem5 simulation environment.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125560488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the efficiency of data deduplication for big data storage management","authors":"Ruijin Zhou, Ming Liu, Tao Li","doi":"10.1109/IISWC.2013.6704674","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704674","url":null,"abstract":"The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129763560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"(Mis)understanding the NUMA memory system performance of multithreaded workloads","authors":"Z. Majó, T. Gross","doi":"10.1109/IISWC.2013.6704666","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704666","url":null,"abstract":"An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127666871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do C and Java programs scale differently on Hardware Transactional Memory?","authors":"Rei Odaira, J. Castaños, T. Nakaike","doi":"10.1109/IISWC.2013.6704668","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704668","url":null,"abstract":"People program in many different programming languages in the multi-core era, but how does each programming language affect application scalability with transactional memory? As commercial implementations of Hardware Transactional Memory (HTM) enter the market, the HTM support in two major programming languages, C and Java, is of critical importance to the industry. We studied the scalability of the same transactional memory applications written in C and Java, using the STAMP benchmarks. We performed our HTM experiments on an IBM mainframe zEnterprise EC12. We found that in 4 of the 10 STAMP benchmarks Java was more scalable than C. The biggest factor in this higher scalability was the efficient thread-local memory allocator in our Java VM. In two of the STAMP benchmarks C was more scalable because in C padding can be inserted efficiently among frequently updated fields to avoid false sharing. We also found Java VM services could cause severe aborts. By fixing or avoiding these problems, we confirmed that C and Java had similar HTM scalability for the STAMP benchmarks.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128545954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic characterization of MapReduce workloads","authors":"Zhihong Xu, Martin Hirzel, G. Rothermel","doi":"10.1109/IISWC.2013.6704673","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704673","url":null,"abstract":"MapReduce is a platform for analyzing large amounts of data on clusters of commodity machines. MapReduce is popular, in part thanks to its apparent simplicity. However, there are unstated requirements for the semantics of MapReduce applications that can affect their correctness and performance. MapReduce implementations do not check whether user code satisfies these requirements, leading to time-consuming debugging sessions, performance problems, and, worst of all, silently corrupt results. This paper makes these requirements explicit, framing them as semantic properties and assumed outcomes. It describes a black-box approach for testing for these properties, and uses the approach to characterize the semantics of 23 non-trivial MapReduce workloads. Surprisingly, we found that for most requirements, there is at least one workload that violates it. This means that MapReduce may be simple to use, but it is not as simple to use correctly. Based on our results, we provide insights to users on how to write higher-quality MapReduce code, and insights to system and language designers on ways to make their platforms more robust.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128840182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench","authors":"D. Pandiyan, Shin-Ying Lee, Carole-Jean Wu","doi":"10.1109/IISWC.2013.6704679","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704679","url":null,"abstract":"In this paper, we explore key microarchitectural features of mobile computing platforms that are crucial to the performance of smart phone applications. We create and use a selection of representative smart phone applications, which we call MobileBench that aid in this analysis. We also evaluate the effectiveness of current memory subsystem on the mobile platforms. Furthermore, by instrumenting the Android framework, we perform energy characterization for MobileBench on an existing Samsung Galaxy S III smart phone. Based on our energy analysis, we find that application cores on modern smart phones consume significant amount of energy. This motivates our detailed performance analysis centered at the application cores. Based on our detailed performance studies, we reach several key findings. (i) Using a more sophisticated tournament branch predictor can improve the branch prediction accuracy but this does not translate to observable performance gain. (ii) Smart phone applications show distinct TLB capacity needs. Larger TLBs can improve performance by an avg. of 14%. (iii) The current L2 cache on most smart phone platform experiences poor utilization because of the fast-changing memory requirements of smart phone applications. Using a more effective cache management scheme improves the L2 cache utilization by as much as 29.3% and by an avg. of 12%. (iv) Smart phone applications are prefetching-friendly. Using a simple stride prefetcher can improve performance across MobileBench applications by an avg. of 14%. (v) Lastly, the memory bandwidth requirements of MobileBench applications are moderate and well under current smart phone memory bandwidth capacity of 8.3 GB/s. With these insights into the smart phone application characteristics, we hope to guide the design of future smart phone platforms for lower power consumptions through simpler architecture while achieving high performance.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115786326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware-independent application characterization","authors":"S. Pakin, P. McCormick","doi":"10.2172/1214640","DOIUrl":"https://doi.org/10.2172/1214640","url":null,"abstract":"The trend in high-performance computing is to include computational accelerators such as GPUs or Xeon Phis in each node of a large-scale system. Qualitatively, such accelerators tend to favor codes that perform large numbers of floating-point and integer operations per branch; that exhibit high degrees of memory locality; and that are highly data-parallel. The question we address in this work is how to quantify those characteristics. To that end we developed an application-characterization tool called Byfl that provides a set of “software performance counters”. These are analogous to the hardware performance counters provided by most modern processors but are implemented via code instrumentation-the equivalent of adding flops = flops + 1 after every floating-point operation but in fact implemented by modifying the compiler's internal representation of the code.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121594150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. C. S. Lucas, R. Auler, Rafael Dalibera, S. Rigo, E. Borin, G. Araújo
{"title":"Modeling virtual machines misprediction overhead","authors":"D. C. S. Lucas, R. Auler, Rafael Dalibera, S. Rigo, E. Borin, G. Araújo","doi":"10.1109/IISWC.2013.6704681","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704681","url":null,"abstract":"Virtual machines are versatile systems that can support innovative solutions to many problems. These systems usually rely on emulation techniques, such as interpretation and dynamic binary translation, to execute guest application code. Usually, in order to select the best emulation technique for each code segment, the system must predict whether the code is worth compiling (frequently executed) or not, known as hotness prediction. In this paper we show that the threshold-based hot code predictor, frequently mispredicts the code hotness and as a result the VM emulation performance become dominated by miscompilations. To do so, we developed a mathematical model to simulate the behavior of such predictor and using it we quantify and characterize the impact of mispredictions in several benchmarks. We also show how the threshold choice can affect the predictor, what are the major overhead components and how using SPEC to analyze a VM performance can lead to misleading results.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132541941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Che, Bradford M. Beckmann, S. Reinhardt, K. Skadron
{"title":"Pannotia: Understanding irregular GPGPU graph applications","authors":"Shuai Che, Bradford M. Beckmann, S. Reinhardt, K. Skadron","doi":"10.1109/IISWC.2013.6704684","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704684","url":null,"abstract":"GPUs have become popular recently to accelerate general-purpose data-parallel applications. However, most existing work has focused on GPU-friendly applications with regular data structures and access patterns. While a few prior studies have shown that some irregular workloads can also achieve speedups on GPUs, this domain has not been investigated thoroughly. Graph applications are one such set of irregular workloads, used in many commercial and scientific domains. In particular, graph mining -as well as web and social network analysis- are promising applications that GPUs could accelerate. However, implementing and optimizing these graph algorithms on SIMD architectures is challenging because their data-dependent behavior results in significant branch and memory divergence. To address these concerns and facilitate research in this area, this paper presents and characterizes a suite of GPGPU graph applications, Pannotia, which is implemented in OpenCL and contains problems from diverse and important graph application domains. We perform a first-step characterization and analysis of these benchmarks and study their behavior on real hardware. We also use clustering analysis to illustrate the similarities and differences of the applications in the suite. Finally, we make architectural and scheduling suggestions that will improve their execution efficiency on GPUs.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114753023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}