2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献_第6页

Spare register aware prefetching for graph algorithms on GPUs gpu上图形算法的备用寄存器感知预取

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835970

Nagesh B. Lakshminarayana, Hyesoon Kim

引用次数: 44

Undersubscribed threading on clustered cache architectures 在集群缓存架构上订阅不足的线程

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835975

W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout

{"title":"Undersubscribed threading on clustered cache architectures","authors":"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout","doi":"10.1109/HPCA.2014.6835975","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835975","url":null,"abstract":"Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127399484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Scalably verifiable dynamic power management 可扩展可验证的动态电源管理

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835967

Opeoluwa Matthews, Meng Zhang, Daniel J. Sorin

引用次数: 9

Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs 抑制遗忘RAM时间通道，同时使信息泄漏和程序效率折衷

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835932

Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, O. Khan, S. Devadas

{"title":"Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs","authors":"Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, O. Khan, S. Devadas","doi":"10.1109/HPCA.2014.6835932","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835932","url":null,"abstract":"Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus. A serious problem in current secure processor ORAM proposals is that they don't obfuscate when ORAM accesses are made, or do so in a very conservative manner. Since secure processors make ORAM accesses on last-level cache misses, ORAM access timing strongly correlates to program access pattern (e.g., locality). This brings ORAM's purpose in secure processors into question. This paper makes two contributions. First, we show how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit. The secure processor is allowed to dynamically optimize ORAM access rate for power/performance, subject to the constraint that the leakage limit is not violated. Second, we show how changing the leakage limit impacts program efficiency. We present a dynamic scheme that leaks at most 32 bits through the ORAM timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection. By reducing leakage to 16 bits, our scheme degrades in performance by 5% but gains in power efficiency by 3%. We show that a static (zero leakage) scheme imposes a 34% power overhead for equivalent performance (or a 30% performance overhead for equivalent power) relative to our dynamic scheme.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126541571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

BigDataBench: A big data benchmark suite from internet services BigDataBench:来自互联网服务的大数据基准套件

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2014-01-06 DOI: 10.1109/HPCA.2014.6835958

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, Bizhu Qiu

{"title":"BigDataBench: A big data benchmark suite from internet services","authors":"Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, Bizhu Qiu","doi":"10.1109/HPCA.2014.6835958","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835958","url":null,"abstract":"As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types, and they are comprehensive for fairly measuring and evaluating big data systems and architecture. BigDataBench is publicly available from the project home page http://prof.ict.ac.cn/BigDataBench. Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115565101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 566

Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks 缠结:面向路由的动态电压最小化的变化影响，节能片上网络

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 1900-01-01 DOI: 10.1109/HPCA.2014.6835953

Amin Ansari, Asit K. Mishra, Jianping Xu, J. Torrellas

{"title":"Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks","authors":"Amin Ansari, Asit K. Mishra, Jianping Xu, J. Torrellas","doi":"10.1109/HPCA.2014.6835953","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835953","url":null,"abstract":"On-chip networks are especially vulnerable to within-die parameter variations. Since they connect distant parts of the chip, they need to be designed to work under the most unfavorable parameter values in the chip. This results in energy-inefficient designs. To improve the energy efficiency of on-chip networks, this paper presents a novel approach that relies on monitoring the errors of messages as they traverse the network. Based on the observed errors of messages, the system dynamically decreases or increases the voltage (Vdd) of groups of network routers. With this approach, called Tangle, the different Vdd values applied to different groups of network routers progressively converge to their lowest, variation-aware, error-free values - always keeping the network frequency unchanged. This saves substantial network energy. In a simulated 64-router network with 4 Vdd domains, Tangle reduces the network energy consumption by an average of 22% with negligible performance impact. In a future network design with one Vdd domain per router, Tangle lowers the network Vdd by an average of 21%, reducing the network energy consumption by an average of 28% with negligible performance impact.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133898463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29