J. Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis
{"title":"Boosting mobile GPU performance with a decoupled access/execute fragment processor","authors":"J. Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis","doi":"10.1145/2366231.2337169","DOIUrl":"https://doi.org/10.1145/2366231.2337169","url":null,"abstract":"Smartphones represent one of the fastest growing markets, providing significant hardware/software improvements every few months. However, supporting these capabilities reduces the operating time per battery charge. The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna. In this paper, we focus on improving the energy efficiency of the GPU since graphical applications consist an important part of the existing market. Moreover, the trend towards better screens will inevitably lead to a higher demand for improved graphics rendering. We show that the main bottleneck for these applications is the texture cache and that traditional techniques for hiding memory latency (prefetching, multithreading) do not work well or come at a high energy cost. We thus propose the migration of GPU designs towards the decoupled access-execute concept. Furthermore, we significantly reduce bandwidth usage in the decoupled architecture by exploiting inter-core data sharing. Using commercial Android applications, we show that the end design can achieve 93% of the performance of a heavily multithreaded GPU while providing energy savings of 34%.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127647870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Side-channel vulnerability factor: A metric for measuring information leakage","authors":"J. Demme, R. Martin, A. Waksman, S. Sethumadhavan","doi":"10.1145/2366231.2337172","DOIUrl":"https://doi.org/10.1145/2366231.2337172","url":null,"abstract":"There have been many attacks that exploit side-effects of program execution to expose secret information and many proposed countermeasures to protect against these attacks. However there is currently no systematic, holistic methodology for understanding information leakage. As a result, it is not well known how design decisions affect information leakage or the vulnerability of systems to side-channel attacks. In this paper, we propose a metric for measuring information leakage called the Side-channel Vulnerability Factor (SVF). SVF is based on our observation that all side-channel attacks ranging from physical to microarchitectural to software rely on recognizing leaked execution patterns. SVF quantifies patterns in attackers' observations and measures their correlation to the victim's actual execution patterns and in doing so captures systems' vulnerability to side-channel attacks. In a detailed case study of on-chip memory systems, SVF measurements help expose unexpected vulnerabilities in whole-system designs and shows how designers can make performance-security trade-offs. Thus, SVF provides a quantitative approach to secure computer architecture.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133934875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A first-order mechanistic model for architectural vulnerability factor","authors":"Arun A. Nair, Stijn Eyerman, L. Eeckhout, L. John","doi":"10.1145/2366231.2337191","DOIUrl":"https://doi.org/10.1145/2366231.2337191","url":null,"abstract":"Soft error reliability has become a first-order design criterion for modern microprocessors. Architectural Vulnerability Factor (AVF) modeling is often used to capture the probability that a radiation-induced fault in a hardware structure will manifest as an error at the program output. AVF estimation requires detailed microarchitectural simulations which are time-consuming and typically present aggregate metrics. Moreover, it requires a large number of simulations to derive insight into the impact of microarchitectural events on AVF. In this work we present a first-order mechanistic analytical model for computing AVF by estimating the occupancy of correct-path state in important microarchitecture structures through inexpensive profiling. We show that the model estimates the AVF for the reorder buffer, issue queue, load and store queue, and functional units in a 4-wide issue machine with a mean absolute error of less than 0.07. The model is constructed from the first principles of out-of-order processor execution in order to provide novel insight into the interaction of the workload with the microarchitecture to determine AVF. We demonstrate that the model can be used to perform design space explorations to understand trade-offs between soft error rate and performance, to study the impact of scaling of microarchitectural structures on AVF and performance, and to characterize workloads for AVF.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133238315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasileios Kontorinis, L. Zhang, Baris Aksanli, J. Sampson, H. Homayoun, Eddie Pettis, D. Tullsen, T. Simunic
{"title":"Managing distributed UPS energy for effective power capping in data centers","authors":"Vasileios Kontorinis, L. Zhang, Baris Aksanli, J. Sampson, H. Homayoun, Eddie Pettis, D. Tullsen, T. Simunic","doi":"10.1145/2366231.2337216","DOIUrl":"https://doi.org/10.1145/2366231.2337216","url":null,"abstract":"Power over-subscription can reduce costs for modern data centers. However, designing the power infrastructure for a lower operating power point than the aggregated peak power of all servers requires dynamic techniques to avoid high peak power costs and, even worse, tripping circuit breakers. This work presents an architecture for distributed per-server UPSs that stores energy during low activity periods and uses this energy during power spikes. This work leverages the distributed nature of the UPS batteries and develops policies that prolong the duration of their usage. The specific approach shaves 19.4% of the peak power for modern servers, at no cost in performance, allowing the installation of 24% more servers within the same power budget. More servers amortize infrastructure costs better and, hence, reduce total cost of ownership per server by 6.3%.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125029559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Yoon, Minseong Jeong, Michael B. Sullivan, M. Erez
{"title":"The dynamic granularity memory system","authors":"D. Yoon, Minseong Jeong, Michael B. Sullivan, M. Erez","doi":"10.1145/2366231.2337222","DOIUrl":"https://doi.org/10.1145/2366231.2337222","url":null,"abstract":"Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126359266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu
{"title":"Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems","authors":"Rachata Ausavarungnirun, K. Chang, Lavanya Subramanian, G. Loh, O. Mutlu","doi":"10.1145/2366231.2337207","DOIUrl":"https://doi.org/10.1145/2366231.2337207","url":null,"abstract":"When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controller's three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114270863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Koibuchi, Hiroki Matsutani, H. Amano, D. Hsu, H. Casanova
{"title":"A case for random shortcut topologies for HPC interconnects","authors":"M. Koibuchi, Hiroki Matsutani, H. Amano, D. Hsu, H. Casanova","doi":"10.1145/2366231.2337179","DOIUrl":"https://doi.org/10.1145/2366231.2337179","url":null,"abstract":"As the scales of parallel applications and platforms increase the negative impact of communication latencies on performance becomes large. Fortunately, modern High Performance Computing (HPC) systems can exploit low-latency topologies of high-radix switches. In this context, we propose the use of random shortcut topologies, which are generated by augmenting classical topologies with random links. Using graph analysis we find that these topologies, when compared to non-random topologies of the same degree, lead to drastically reduced diameter and average shortest path length. The best results are obtained when adding random links to a ring topology, meaning that good random shortcut topologies can easily be generated for arbitrary numbers of switches. Using flit-level discrete event simulation we find that random shortcut topologies achieve throughput comparable to and latency lower than that of existing non-random topologies such as hypercubes and tori. Finally, we discuss and quantify practical challenges for random shortcut topologies, including routing scalability and larger physical cable lengths.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117265287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Krishna T. Malladi, Frank A. Nothaft, Karthika Periyathambi, Benjamin C. Lee, C. Kozyrakis, M. Horowitz
{"title":"Towards energy-proportional datacenter memory with mobile DRAM","authors":"Krishna T. Malladi, Frank A. Nothaft, Karthika Periyathambi, Benjamin C. Lee, C. Kozyrakis, M. Horowitz","doi":"10.1145/2366231.2337164","DOIUrl":"https://doi.org/10.1145/2366231.2337164","url":null,"abstract":"To increase datacenter energy efficiency, we need memory systems that keep pace with processor efficiency gains. Currently, servers use DDR3 memory, which is designed for high bandwidth but not for energy proportionality. A system using 20% of the peak DDR3 bandwidth consumes 2.3× the energy per bit compared to the energy consumed by a system with fully utilized memory bandwidth. Nevertheless, many datacenter applications stress memory capacity and latency but not memory bandwidth. In response, we architect server memory systems using mobile DRAM devices, trading peak bandwidth for lower energy consumption per bit and more efficient idle modes. We demonstrate 3-5× lower memory power, better proportionality, and negligible performance penalties for data-center workloads.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114136384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic Shared Cache Management (PriSM)","authors":"R. Manikantan, K. Rajan, Ramaswamy Govindarajan","doi":"10.1145/2366231.2337208","DOIUrl":"https://doi.org/10.1145/2366231.2337208","url":null,"abstract":"Effective sharing of the last level cache has a significant influence on the overall performance of a multicore system. We observe that existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and in some cases lack the flexibility to support a variety of performance goals. In this paper, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals. We demonstrate the flexibility of PriSM, by computing the eviction probabilities needed to achieve goals like hit-maximization, fairness and QOS. PriSM-HitMax improves performance by 18.7% over LRU and 11.8% over previously proposed schemes in a sixteen core machine. PriSM-Fairness improves fairness over existing solutions by 23.3% along with a performance improvement of 19.0%. PriSM-QOS successfully achieves the desired QOS targets.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114809999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving writeback efficiency with decoupled last-write prediction","authors":"Zhe Wang, S. Khan, Daniel A. Jiménez","doi":"10.1145/2366231.2337195","DOIUrl":"https://doi.org/10.1145/2366231.2337195","url":null,"abstract":"In modern DDRx memory systems, memory write requests compete with read requests for available memory resources, significantly increasing the average read request service time. Caches are used to mitigate long memory read latency that limits system performance. Dirty blocks in the last-level cache (LLC) that will not be written again before they are evicted will eventually be written back to memory. We refer to these blocks as last-write blocks. In this paper, we propose an LLC writeback technique that improves DRAM efficiency by scheduling predicted last-write blocks early. We propose a low overhead last-write predictor for the LLC. The predicted last-write blocks are made available to the memory controller for scheduling. This technique effectively re-distributes the memory requests and expands writes scheduling opportunities, allowing writes to be serviced efficiently by DRAM. The technique is flexible enough to be applied to any LLC replacement policy. Our evaluation with multi-programmed workloads shows that the technique significantly improves performance by 6.5%-11.4% on average over the traditional writeback technique in an eight-core processor with various DRAM configurations running memory intensive benchmarks.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124497832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}