{"title":"Pinpointing data locality bottlenecks with low overhead","authors":"Xu Liu, J. Mellor-Crummey","doi":"10.1109/ISPASS.2013.6557169","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557169","url":null,"abstract":"A wide gap exists between the speed of modern processors and memory subsystems. As a result, long latencies associated with fetching data from memory often significantly degrade execution performance. To aid with program tuning, application developers need tools to analyze memory access patterns and guide them how to reuse data in the fastest levels of a system's memory hierarchy. In this paper, we describe a novel, efficient, and effective tool for data locality measurement and analysis. Unlike other tools, our tool uses both statistical PMU sampling to quantify the cost of data locality bottlenecks and cache simulation to compute reuse distance to diagnose the causes of locality problems. This approach enables us to collect rich information to provide insight into a program's data locality problems. Our tool attributes quantitative measurements of observed memory latency to program variables and dynamically allocated data, as well as code. Our tool identifies data touched by reuse pairs and the accesses involved, identified with their full calling context. Finally, our tool employs both sampling and parallelization to accelerate the computation of representative reuse distance information. Experiments show that with an overhead of only about 13%, our tool provides detailed insights that enabled us to make non-trivial improvements to memory-bound HPC benchmarks.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114290616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A statistical machine learning based modeling and exploration framework for run-time cross-stack energy optimization","authors":"Changshu Zhang, A. Ravindran","doi":"10.1109/ISPASS.2013.6557161","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557161","url":null,"abstract":"As the complexity of many-core processors grow, meeting performance, energy, temperature, reliability, and noise requirements under dynamically changing operating conditions requires run-time optimization of all parts of the computing stack - architecture, system software, and applications. Unfortunately, the combination of design parameters for the entire computing stack results in an operating space of millions of points that must be explored and evaluated at run-time. In this paper, we present a statistical machine learning (SML) based modeling framework that can be used to rapidly explore such vast operating spaces. We construct a multivariate adaptive regression spline (MARS) based model that uses a number of architecture and application parameters as predictor variables to predict performance and power. We then use a Pareto-front exploring evolutionary algorithm to determine operating points for optimal power and performance. The operating points constituting the Pareto front are stored in look-up tables for runtime use. The proposed framework is applied to an ×264 video encoding application executing on a quad core processor. The microarchitectural predictor variables include core and cache parameters. The application predictor variables include the video resolution, and visual quality determined by the choice of the motion estimation algorithm. The model outputs the average frames per second (FPS) and the average power consumption. The MARS model has an R2 of 0.9657 and 0.9467 respectively for FPS and power. For a video frame resolution of 480x320, and FPS of 20, a power saving of 55% can be obtained by jointly tuning the microarchitectural parameters and the visual quality.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126496847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the microarchitectural side effects of operating system calls","authors":"A. Mayberry, Matthew Laquidara, C. Weems","doi":"10.1109/ISPASS.2013.6557158","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557158","url":null,"abstract":"We measure the collateral effect on microarchitectural state of system calls using a validated, cycle-accurate simulator. Our results demonstrate that, in some cases, the disruption on user-mode performance is significant. This disruption varies by the operating system and even the kernel version in use.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"26 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120812748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synergistic coupling of SSD and hard disk for QoS-aware virtual memory","authors":"Ke Liu, Xuechen Zhang, K. Davis, Song Jiang","doi":"10.1109/ISPASS.2013.6557143","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557143","url":null,"abstract":"With significant advantages in capacity, power consumption, and price, solid state disk (SSD) has good potential to be employed as an extension of DRAM (memory), such that applications with large working sets could run efficiently on a modestly configured system. While initial results reported in recent works show promising prospects for this use of SSD by incorporating it into the management of virtual memory, frequent writes from write-intensive programs could quickly wear out SSD, making the idea less practical. We propose a scheme, HybridSwap, that integrates a hard disk with an SSD for virtual memory management, synergistically achieving the advantages of both. In addition, HybridSwap can constrain performance loss caused by swapping according to user-specified QoS requirements. To minimize writes to the SSD without undue performance loss, HybridSwap sequentially swaps a set of pages of virtual memory to the hard disk if they are expected to be read together. Using a history of page access patterns HybridSwap dynamically creates an out-of-memory virtual memory page layout on the swap space spanning the SSD and hard disk such that random reads are served by SSD and sequential reads are asynchronously served by the hard disk with high efficiency. In practice HybridSwap can effectively exploit the aggregate bandwidth of the two devices to accelerate page swapping. We have implemented HybridSwap in a recent Linux kernel, version 2.6.35.7. Our evaluation with representative benchmarks, such as Memcached for key-value store, and scientific programs from the ALGLIB cross-platform numerical analysis and data processing library, shows that the number of writes to SSD can be reduced by 40% with the system's performance comparable to that with pure SSD swapping, and can satisfy a swapping-related QoS requirement as long as","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128220275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trace filtering of multithreaded applications for CMP memory simulation","authors":"Alejandro Rico, Alex Ramírez, M. Valero","doi":"10.1109/ISPASS.2013.6557160","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557160","url":null,"abstract":"Recent works have shown that modelling the performance of out-of-order superscalar cores is doable using filtered memory traces for single thread simulations. However, those techniques do not account for cache coherence actions so they cannot be used reliably in multithreaded scenarios. In this paper, we leverage the structure of parallel applications to propose a simulation methodology that enables the use of filtered memory traces for the simulation of multithreaded applications on multicore architectures. In our experiments our proposal reduced the simulation error of state-of-the-art techniques in 39% on average, while only losing 9.5% of simulation speedup.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131045504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling","authors":"Jung Ho Ahn, Sheng Li, O. Seongil, N. Jouppi","doi":"10.1109/ISPASS.2013.6557148","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557148","url":null,"abstract":"With their significant performance and energy advantages, emerging manycore processors have also brought new challenges to the architecture research community. Manycore processors are highly integrated complex system-on-chips with complicated core and uncore subsystems. The core subsystems can consist of a large number of traditional and asymmetric cores. The uncore subsystems have also become unprecedentedly powerful and complex with deeper cache hierarchies, advanced on-chip interconnects, and high-performance memory controllers. In order to conduct research for emerging manycore processor systems, a microarchitecture-level and cycle-level manycore simulation infrastructure is needed. This paper introduces McSimA+, a new timing simulation infrastructure, to meet these needs. McSimA+ models x86based asymmetric manycore microarchitectures in detail for both core and uncore subsystems, including a full spectrum of asymmetric cores from single-threaded to multithreaded and from in-order to out-of-order, sophisticated cache hierarchies, coherence hardware, on-chip interconnects, memory controllers, and main memory. McSimA+ is an application-level+ simulator, offering a middle ground between a full-system simulator and an application-level simulator. Therefore, it enjoys the light weight of an application-level simulator and the full control of threads and processes as in a full-system simulator. This paper also explores an asymmetric clustered manycore architecture that can reduce the thread migration cost to achieve a noticeable performance improvement compared to a state-of-the-art asymmetric manycore architecture.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-determinism and overcount on modern hardware performance counter implementations","authors":"Vincent M. Weaver, D. Terpstra, S. Moore","doi":"10.1109/ISPASS.2013.6557172","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557172","url":null,"abstract":"Ideal hardware performance counters provide exact deterministic results. Real-world performance monitoring unit (PMU) implementations do not always live up to this ideal. Events that should be exact and deterministic (such as retired instructions) show run-to-run variation and overcount on ×86_64 machines, even when run in strictly controlled environments. These effects are non-intuitive to casual users and cause difficulties when strict determinism is desirable, such as when implementing deterministic replay or deterministic threading libraries. We investigate eleven different x86 64 CPU implementations and discover the sources of divergence from expected count totals. Of all the counter events investigated, we find only a few that exhibit enough determinism to be used without adjustment in deterministic execution environments. We also briefly investigate ARM, IA64, POWER and SPARC systems and find that on these platforms the counter events have more determinism. We explore various methods of working around the limitations of the ×86_64 events, but in many cases this is not possible and would require architectural redesign of the underlying PMU.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting benchmark combinations for the evaluation of multicore throughput","authors":"Ricardo A. Velásquez, P. Michaud, André Seznec","doi":"10.1109/ISPASS.2013.6557168","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557168","url":null,"abstract":"Most high-performance processors today are able to execute multiple threads of execution simultaneously. Threads share processor resources, like the last-level cache, which may decrease throughput in a non obvious way, depending on threads' characteristics. Computer architects usually study multiprogrammed workloads by considering a set of benchmarks and some combinations of these benchmarks. Because detailed microarchitecture simulators are slow, we want a subset of combinations that is as small as possible, yet representative. However, there is no standard method for selecting such sample, and different authors have used different methods. It is not clear how the choice of a particular sample impacts the conclusions of a study. We propose and compare different sampling methods for defining multiprogrammed workloads for computer architecture studies. We evaluate their effectiveness with a case study, the comparison of several multicore last-level cache replacement policies. We show that random sampling, the simplest method, is a possible way to define a representative workload sample, provided the sample is large enough. We propose a method for estimating the required sample size based on fast approximate simulation. We propose a new method, workload stratification, which is very effective at reducing the sample size in situations where random sampling would require large samples.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132823582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing the Transparent Page Sharing in Java","authors":"Kazunori Ogata, Tamiya Onodera","doi":"10.1109/ISPASS.2013.6557144","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557144","url":null,"abstract":"Improving memory utilization is important for improving the efficiency of a cloud datacenter by increasing the number of usable VMs. Memory over-commitment is a common technique for this purpose. Transparent Page Sharing (TPS) is a technique to improve the utilization by sharing identical memory pages to reduce the total memory consumption. For a cloud datacenter, we might expect TPS will reduce memory usage because VMs often execute the same OS and middleware and thus they may have many identical pages. However, TPS is less effective for Java-based middleware because the Java VM finds it difficult to manage the layouts of internal data structures that depend on the execution of Java programs. This paper presents detailed breakdowns of the memory usage of KVM guest VMs executing a Java-based Web application server. Then we propose increasing the amount of page sharing by utilizing a class sharing mechanism in the Java VM. Our approach reduced the measured physical memory for class metadata by up to 89.6% when using the Apache DayTrader benchmark running on four guest VMs in a KVM host machine.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129020833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ISA-independent workload characterization and its implications for specialized architectures","authors":"Y. Shao, D. Brooks","doi":"10.1109/ISPASS.2013.6557175","DOIUrl":"https://doi.org/10.1109/ISPASS.2013.6557175","url":null,"abstract":"Specialized architectures will become increasingly important as the computing industry demands more energy-efficient designs. The application-centric design style for these architectures is heavily dependent on workload characterization of intrinsic program characteristics, but at the same time these architectures are likely to be decoupled from legacy ISAs. In this work, we perform ISA-independent workload characterization for a variety of important intrinsic program characteristics relating to computation, memory, and control flow. The analysis is performed using a JIT compiler that emits ISA-independent instructions. We compare this analysis with an x86 trace and find that several of the analyses are highly sensitive to the ISA. We conclude that designers of specialized architectures must adopt ISA-independent workload characterization approaches.","PeriodicalId":299172,"journal":{"name":"2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130406663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}