{"title":"Spare register aware prefetching for graph algorithms on GPUs","authors":"Nagesh B. Lakshminarayana, Hyesoon Kim","doi":"10.1109/HPCA.2014.6835970","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835970","url":null,"abstract":"More and more graph algorithms are being GPU enabled. Graph algorithm implementations on GPUs have irregular control flow and are memory-intensive with many irregular/data-dependent memory accesses. Due to these factors graph algorithms on GPUs have low execution efficiency. In this work we propose a mechanism to improve the execution efficiency of graph algorithms by improving their memory access latency tolerance. We propose a mechanism for prefetching data for load pairs that have one load dependent on the other - such pairs are common in graph algorithms. Our mechanism detects the target loads in hardware and injects instructions into the pipeline to prefetch data into spare registers that are not being used by any active threads. By prefetching data into registers, early eviction of prefetched data can be eliminated. We also propose a mechanism that uses the compiler to identify the target loads. Our mechanism improves performance over no prefetching by 10% on average and upto 51% for nine memory intensive graph algorithm kernels.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133436668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout
{"title":"Undersubscribed threading on clustered cache architectures","authors":"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout","doi":"10.1109/HPCA.2014.6835975","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835975","url":null,"abstract":"Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127399484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalably verifiable dynamic power management","authors":"Opeoluwa Matthews, Meng Zhang, Daniel J. Sorin","doi":"10.1109/HPCA.2014.6835967","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835967","url":null,"abstract":"Dynamic power management (DPM) is critical to maximizing the performance of systems ranging from multicore processors to datacenters. However, one formidable challenge with DPM schemes is verifying that the DPM schemes are correct as the number of computational resources scales up. In this paper, we develop a DPM scheme such that it is scalably verifiable with fully automated formal tools. The key to the design is that the DPM scheme has fractal behavior; that is, it behaves the same at every scale. We show that the fractal design enables scalable formal verification and simulation shows that our scheme does not sacrifice much performance compared to an oracle DPM scheme that optimally allocates power to computational resources. We implement our scheme in a 2-socket 16-core x86 system and experimentally evaluate it.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115278189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, O. Khan, S. Devadas
{"title":"Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs","authors":"Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, O. Khan, S. Devadas","doi":"10.1109/HPCA.2014.6835932","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835932","url":null,"abstract":"Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus. A serious problem in current secure processor ORAM proposals is that they don't obfuscate when ORAM accesses are made, or do so in a very conservative manner. Since secure processors make ORAM accesses on last-level cache misses, ORAM access timing strongly correlates to program access pattern (e.g., locality). This brings ORAM's purpose in secure processors into question. This paper makes two contributions. First, we show how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit. The secure processor is allowed to dynamically optimize ORAM access rate for power/performance, subject to the constraint that the leakage limit is not violated. Second, we show how changing the leakage limit impacts program efficiency. We present a dynamic scheme that leaks at most 32 bits through the ORAM timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection. By reducing leakage to 16 bits, our scheme degrades in performance by 5% but gains in power efficiency by 3%. We show that a static (zero leakage) scheme imposes a 34% power overhead for equivalent performance (or a 30% performance overhead for equivalent power) relative to our dynamic scheme.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126541571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, Bizhu Qiu
{"title":"BigDataBench: A big data benchmark suite from internet services","authors":"Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, Bizhu Qiu","doi":"10.1109/HPCA.2014.6835958","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835958","url":null,"abstract":"As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types, and they are comprehensive for fairly measuring and evaluating big data systems and architecture. BigDataBench is publicly available from the project home page http://prof.ict.ac.cn/BigDataBench. Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115565101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amin Ansari, Asit K. Mishra, Jianping Xu, J. Torrellas
{"title":"Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks","authors":"Amin Ansari, Asit K. Mishra, Jianping Xu, J. Torrellas","doi":"10.1109/HPCA.2014.6835953","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835953","url":null,"abstract":"On-chip networks are especially vulnerable to within-die parameter variations. Since they connect distant parts of the chip, they need to be designed to work under the most unfavorable parameter values in the chip. This results in energy-inefficient designs. To improve the energy efficiency of on-chip networks, this paper presents a novel approach that relies on monitoring the errors of messages as they traverse the network. Based on the observed errors of messages, the system dynamically decreases or increases the voltage (Vdd) of groups of network routers. With this approach, called Tangle, the different Vdd values applied to different groups of network routers progressively converge to their lowest, variation-aware, error-free values - always keeping the network frequency unchanged. This saves substantial network energy. In a simulated 64-router network with 4 Vdd domains, Tangle reduces the network energy consumption by an average of 22% with negligible performance impact. In a future network design with one Vdd domain per router, Tangle lowers the network Vdd by an average of 21%, reducing the network energy consumption by an average of 28% with negligible performance impact.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133898463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}