Zhibin Yu, Hai Jin, Nilanjan Goswami, Tao Li, L. John
{"title":"Hierarchically characterizing CUDA program behavior","authors":"Zhibin Yu, Hai Jin, Nilanjan Goswami, Tao Li, L. John","doi":"10.1109/IISWC.2011.6114201","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114201","url":null,"abstract":"CUDA has become a very popular programming paradigm in parallel computing area. However, very little work has been done for characterizing CUDA kernels. In this work, we measure the thread level performance, collect the basic block level characteristics, and glean the instruction level properties for about 35 programs from CUDA SDK, Parboil, and Rodinia benchmark suites. In addition, we define basic block vectors, synchronization vectors and thread similarity matrix to capture the characteristics of CUDA programs efficiently. We find that CUDA programs have some unique characteristics at each level compared to sequential programs.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121095500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony Gutierrez, R. Dreslinski, T. Wenisch, T. Mudge, A. Saidi, C. D. Emmons, N. Paver
{"title":"Full-system analysis and characterization of interactive smartphone applications","authors":"Anthony Gutierrez, R. Dreslinski, T. Wenisch, T. Mudge, A. Saidi, C. D. Emmons, N. Paver","doi":"10.1109/IISWC.2011.6114205","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114205","url":null,"abstract":"Smartphones have recently overtaken PCs as the primary consumer computing device in terms of annual unit shipments. Given this rapid market growth, it is important that mobile system designers and computer architects analyze the characteristics of the interactive applications users have come to expect on these platforms. With the introduction of high-performance, low-power, general purpose CPUs in the latest smartphone models, users now expect PC-like performance and a rich user experience, including high-definition audio and video, high-quality multimedia, dynamic web content, responsive user interfaces, and 3D graphics. In this paper, we characterize the microarchitectural behavior of representative smartphone applications on a current-generation mobile platform to identify trends that might impact future designs. To this end, we measure a suite of widely available mobile applications for audio, video, and interactive gaming. To complete this suite we developed BBench, a new fully-automated benchmark to assess a web-browser's performance when rendering some of the most popular and complex sites on the web. We contrast these applications' characteristics with those of the SPEC CPU2006 benchmark suite. We demonstrate that real-world interactive smartphone applications differ markedly from the SPEC suite. Specifically the instruction cache, instruction TLB, and branch predictor suffer from poor performance. We conjecture that this is due to the applications' reliance on numerous high level software abstractions (shared libraries and OS services). Similar trends have been observed for UI-intensive interactive applications on the desktop.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126446551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelization and characterization of pattern matching using GPUs","authors":"G. Vasiliadis, M. Polychronakis, S. Ioannidis","doi":"10.1109/IISWC.2011.6114181","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114181","url":null,"abstract":"Pattern matching is a highly computationally intensive operation used in a plethora of applications. Unfortunately, due to the ever increasing storage capacity and link speeds, the amount of data that needs to be matched against a given set of patterns is growing rapidly. In this paper, we explore how the highly parallel computational capabilities of commodity graphics processing units (GPUs) can be exploited for high-speed pattern matching. We present the design, implementation, and evaluation of a pattern matching library running on the GPU, which can be used transparently by a wide range of applications to increase their overall performance. The library supports both string searching and regular expression matching on the NVIDIA CUDA architecture. We have also explored the performance impact of different types of memory hierarchies, and present solutions to alleviate memory congestion problems. The results of our performance evaluation using off-the-self graphics processors demonstrate that GPU-based pattern matching can reach tens of gigabits per second on different workloads.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127004314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Multi-Program Performance Model: Debunking current practice in multi-core simulation","authors":"K. V. Craeynest, L. Eeckhout","doi":"10.1109/IISWC.2011.6114194","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114194","url":null,"abstract":"Composing a representative multi-program multi-core workload is non-trivial. A multi-core processor can execute multiple independent programs concurrently, and hence, any program mix can form a potential multi-program workload. Given the very large number of possible multi-program workloads and the limited speed of current simulation methods, it is impossible to evaluate all possible multi-program workloads.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130842690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autocorrelation analysis: A new and improved method for branch predictability characterization","authors":"Jing Chen, L. John","doi":"10.1109/IISWC.2011.6114179","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114179","url":null,"abstract":"Branch predictability characterization not only helps to improve branch prediction but also helps to optimize predicated execution. Branch taken rate and branch transition rate have been proposed to characterize the branch predictability. However, these two metrics may misclassify branches with regular history patterns as hard-to-predict branches, causing an inaccurate and ambiguous view of branch predictability. In this paper, we utilize autocorrelation based analysis of branch history patterns and present two orthogonal metrics Degree of Pattern Irregularity (DPI) and Effective Pattern Length (EPL). Unlike the existing taken rate or transition rate, DPI directly measures the regularity of the patterns in per-address branch history, and hence is more accurate in branch classification. On the other hand, EPL reveals the optimum branch history length for the easy-to-predict branches. The proposed metrics are evaluated with PAs, GAs, and Perceptron branch predictors, and the results show that on average, DPI improves the accuracy of hard-to-predict branch classification by up to 17.7% over taken rate and 15.0% over transition rate for the workloads in this study. It is also able to identify 18.9% more easy-to-predict branches compared with taken rate and 12.8% more compared with transition rate. The proposed metrics are valuable extension to the existing metrics for accurately characterizing branch predictability.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131389112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leonardo Piga, R. Bergamaschi, F. Klein, R. Azevedo, S. Rigo
{"title":"Empirical Web server power modeling and characterization","authors":"Leonardo Piga, R. Bergamaschi, F. Klein, R. Azevedo, S. Rigo","doi":"10.1109/IISWC.2011.6114200","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114200","url":null,"abstract":"Commodity processors, which are prevalent in Internet-based data centers, do not have internal sensors for monitoring energy consumption. Such processors usually feature performance counters which can be used to indirectly estimate power consumption [1]. The usual approach in those studies is to derive linear power models based on the usage numbers collected for the processor sub-components such as caches and branch predictor. These models are usually targeted to CPU-bound applications which need more CPU performance counter parameters and display high CPU usage most of time. On a Web server environment, the applications are mostly I/O-bound which creates non-linear effects among server statistics of performance and power, making these models less suitable for Web servers. This paper presents a new approach for power models for Web servers, based on ranges of CPU usage values and performance server statistics. This new method softens non-linear relationship between server statistics and power consumption on linear power models improving their accuracy.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130415970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the memory system requirements of future scientific applications: Four case-studies","authors":"Milan Pavlović, Yoav Etsion, Alex Ramírez","doi":"10.1109/IISWC.2011.6114176","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114176","url":null,"abstract":"In this paper, we observe and characterize the memory behaviour, and specifically memory footprint, memory bandwidth and cache effectiveness, of several well-known parallel scientific applications running on a large processor cluster. Based on the analysis of their instrumented execution, we project some performance requirements from future memory systems serving large-scale chip multiprocessors (CMPs). In addition, we estimate the impact of memory system performance on the amount of instruction stalls, as well as on the real computational performance, using the number of floating point operations per second the applications perform. Our projections show that the limitations of present memory technologies, either by means of capacity or bandwidth, will have a strong negative impact on scalability of memory systems for large CMPs. We conclude that future supercomputer systems require research on new alternative memory architectures, capable of offering both capacity and bandwidth beyond what current solutions provide.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125277987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Jongerius, Phillip Stanley-Marbell, H. Corporaal
{"title":"Quantifying the common computational problems in contemporary applications","authors":"R. Jongerius, Phillip Stanley-Marbell, H. Corporaal","doi":"10.1109/IISWC.2011.6114199","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114199","url":null,"abstract":"Selecting, for each application, the top five functions for manual code inspection resulted in analyzing a portion of the source code accounting for 77% of the total run time. Figure 2 shows the fraction of the analyzed run time covered by the 16 identified CPs (some are condensed in one slice for clarity.)","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"458 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125848996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christina Delimitrou, S. Sankar, Kushagra Vaid, C. Kozyrakis
{"title":"Decoupling datacenter studies from access to large-scale applications: A modeling approach for storage workloads","authors":"Christina Delimitrou, S. Sankar, Kushagra Vaid, C. Kozyrakis","doi":"10.1109/IISWC.2011.6114196","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114196","url":null,"abstract":"The cost and power impact of suboptimal storage configurations is significant in datacenters (DCs) as inefficiencies are aggregated over several thousand servers and represent considerable losses in capital and operating costs. Designing performance, power and cost-optimized systems requires a deep understanding of target workloads, and mechanisms to effectively model different storage design choices. Traditional benchmarking is invalid in cloud data-stores, representative storage profiles are hard to obtain, while replaying the entire application in all storage configurations is impractical both from a cost and time perspective. Despite these issues, current workload generators are not able to accurately reproduce key aspects of real application patterns. Some of these features include spatial and temporal locality, as well as tuning the intensity of the workload to emulate different storage system configurations. To address these limitations, we propose a modeling and characterization framework for large-scale storage applications. As part of this framework we use a state diagram-based storage model, extend it to a hierarchical representation and implement a tool that consistently recreates I/O loads of DC applications. We present the principal features of the framework that allow accurate modeling and generation of storage workloads and the validation process performed against ten original DC applications traces. Furthermore, using our framework, we perform an in-depth, per-thread characterization of these applications and provide insights on their behavior. Finally, we explore two practical applications of this methodology: SSD caching and defragmentation benefits on enterprise storage. In both cases we observe significant speedup for most of the examined applications. Since knowledge of the workload's spatial and temporal locality is necessary to model these use cases, our framework was instrumental in quantifying their performance benefits. The proposed methodology provides a detailed understanding on the storage activity of large-scale applications and enables a wide spectrum of storage studies without the requirement for access to real applications and full application deployment.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129993579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A quantitative analysis of cooling power in container-based data centers","authors":"Amer Qouneh, Chao Li, Tao Li","doi":"10.1109/IISWC.2011.6114197","DOIUrl":"https://doi.org/10.1109/IISWC.2011.6114197","url":null,"abstract":"Cooling power is often represented as a single taxed cost on the total energy consumption of the data center. Some estimates go as far as 50% of the total energy demand. However, this view is rather simplistic in the presence of a multitude of cooling options and optimizations. In response to the rising cost of energy, the industry introduced modular design in the form of containers to serve as the new building block for data centers. However, it is still unclear how efficient they are compared to raised-floor data centers and under what conditions they are preferred. In this paper, we provide comparative and quantitative analysis of cooling power in both container-based and raised-floor data centers. Our results show that a container achieves 80% and 42% savings in cooling and facility powers respectively compared to a raised-floor data center and that savings of 41% in cooling power are possible when workloads are consolidated onto the least number of containers. We also show that cooling optimizations are not very effective at high utilizations; and that a raised-floor data center can approach the efficiency of a container at low utilizations when employing a simple cooling optimization.","PeriodicalId":367515,"journal":{"name":"2011 IEEE International Symposium on Workload Characterization (IISWC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127665189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}