2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

筛选
英文 中文
FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads FUSE:将STT-MRAM融合到gpu中,以减轻片外内存访问开销
Jie Zhang, Myoungsoo Jung, M. Kandemir
{"title":"FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads","authors":"Jie Zhang, Myoungsoo Jung, M. Kandemir","doi":"10.1109/HPCA.2019.00055","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00055","url":null,"abstract":"In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Enabling Transparent Memory-Compression for Commodity Memory Systems 为商用内存系统启用透明内存压缩
Vinson Young, S. Kariyappa, Moinuddin K. Qureshi
{"title":"Enabling Transparent Memory-Compression for Commodity Memory Systems","authors":"Vinson Young, S. Kariyappa, Moinuddin K. Qureshi","doi":"10.1109/HPCA.2019.00010","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00010","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123630728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Freeway: Maximizing MLP for Slice-Out-of-Order Execution 高速公路:最大化MLP的切片乱序执行
Rakesh Kumar, M. Alipour, D. Black-Schaffer
{"title":"Freeway: Maximizing MLP for Slice-Out-of-Order Execution","authors":"Rakesh Kumar, M. Alipour, D. Black-Schaffer","doi":"10.1109/HPCA.2019.00009","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00009","url":null,"abstract":"Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exp ...","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120862627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Conditional Speculation: An Effective Approach to Safeguard Out-of-Order Execution Against Spectre Attacks 条件推测:防止幽灵攻击的一种有效方法
Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, Dan Meng
{"title":"Conditional Speculation: An Effective Approach to Safeguard Out-of-Order Execution Against Spectre Attacks","authors":"Peinan Li, Lutan Zhao, Rui Hou, Lixin Zhang, Dan Meng","doi":"10.1109/HPCA.2019.00043","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00043","url":null,"abstract":"Speculative execution side-channel vulnerabilities such as Spectre reveal that conventional architecture designs lack security consideration. This paper proposes a software transparent defense mechanism, named as Conditional Speculation, against Spectre vulnerabilities found on traditional out-of-order microprocessors. It introduces the concept of security dependence to mark speculative memory instructions which could leak information with potential security risk. More specifically, security-dependent instructions are detected and marked with suspect speculation flags in the Issue Queue. All the instructions can be speculatively issued for execution in accordance with the classic out-of-order pipeline. For those instructions with suspect speculation flags, they are considered as safe instructions if their speculative execution will not refill new cache lines with unauthorized privilege data. Otherwise, they are considered as unsafe instructions and thus not allowed to execute speculatively. To reduce the performance impact from not executing unsafe instructions speculatively, we investigate two filtering mechanisms, Cachehit based Hazard Filter and Trusted Page Buffer based Hazard Filter to filter out false security hazards. Our design philosophy is to speculatively execute safe instructions to maintain the performance benefits of out-of-order execution while blocking the speculative execution of unsafe instructions for security consideration. We evaluate Conditional Speculation in terms of performance, security and area. The experimental results show that the hardware overhead is marginal and the performance overhead is minimal. Keywords-Spectre vulnerabilities defense; Security dependence; Speculative execution side-channel vulnerabilities;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123077421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
String Figure: A Scalable and Elastic Memory Network Architecture 字符串图:一个可伸缩和弹性的内存网络架构
Matheus A. Ogleari, Ye Yu, Chen Qian, E. Miller, Jishen Zhao
{"title":"String Figure: A Scalable and Elastic Memory Network Architecture","authors":"Matheus A. Ogleari, Ye Yu, Chen Qian, E. Miller, Jishen Zhao","doi":"10.1109/HPCA.2019.00016","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00016","url":null,"abstract":"Author(s): Ogleari, Matheus Almeida; Yu, Ye; Qian, Chen; Miller, Ethan L; Zhao, Jishen; IEEE","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131550941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation 通过源级分析和基于跟踪的仿真实现快速准确GPU性能估计的混合框架
Xiebing Wang, Kai Huang, A. Knoll, Xuehai Qian
{"title":"A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation","authors":"Xiebing Wang, Kai Huang, A. Knoll, Xuehai Qian","doi":"10.1109/HPCA.2019.00062","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00062","url":null,"abstract":"This paper proposes a hybrid framework for fast and accurate performance estimation of OpenCL kernels running on GPUs. The kernel execution flow is statically analyzed and thereupon the execution trace is generated via a loop-based bidirectional branch search. Then the trace is dynamically simulated to perform a dummy execution of the kernel to obtain the estimated time. The framework does not rely on profiling or measurement results which are used in conventional performance estimation techniques. Moreover, the lightweight trace-based simulation consumes much less time than a fine-grained GPU simulator. Our framework can accurately grasp the variation trend of the execution time in the design space and robustly predict the performance of the kernels across two generations of recent Nvidia GPU architectures. Experiments on four Commercial Off-The-Shelf (COTS) GPUs show that our framework can predict the runtime performance with average Mean Absolute Percentage Error (MAPE) of 17.04% and time consumption of a few seconds. We also demonstrate the practicability of our framework with a realworld application.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128159014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Poly: Efficient Heterogeneous System and Application Management for Interactive Applications 高效异构系统和交互式应用程序管理
Shuo Wang, Yun Liang, Wei Zhang
{"title":"Poly: Efficient Heterogeneous System and Application Management for Interactive Applications","authors":"Shuo Wang, Yun Liang, Wei Zhang","doi":"10.1109/HPCA.2019.00038","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00038","url":null,"abstract":"QoS-sensitive workloads, common in warehousescale datacenters, require a guaranteed stable tail latency percentile response latency) of the service. Unfortunately, the system load (e.g., RPS) fluctuates drastically during daily datacenter operations. In order to meet the maximum system RPS requirement, datacenter tends to overprovision the hardware accelerators, which makes the datacenter underutilized.Therefore, the throughput and energy efficiency scaling of the current accelerator-outfitted datacenter are very expensive for QoS-sensitive workloads. To overcome this challenge, this work introduces Poly, an OpenCL based heterogeneous system optimization framework that targets to improve the overall throughput scalability and energy proportionality while guaranteeing the QoS by efficiently utilizing GPUs and FPGAs based accelerators within datacenter. Poly is mainly composed of two phases. At compile-time, Poly automatically captures the parallel patterns in the applications and explores a comprehensive design space within and across parallel patterns. At runtime, Poly relies on a runtime kernel scheduler to judiciously make the scheduling decisions to accommodate the dynamic latency and throughput requirements. Experiments using a variety of cloud QoS-sensitive applications show that Poly improves the energy proportionality by 23%(17%) without sacrificing the QoS compared to the state-of-the-art GPU (FPGA) solution, respectively. Keywords-Heterogeneous; GPU; FPGA; Performance Optimization;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127512748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Understanding the Impact of Socket Density in Density Optimized Servers 了解Socket密度对密度优化服务器的影响
Manish Arora, Matt Skach, Wei Huang, Xudong An, Jason Mars, Lingjia Tang, D. Tullsen
{"title":"Understanding the Impact of Socket Density in Density Optimized Servers","authors":"Manish Arora, Matt Skach, Wei Huang, Xudong An, Jason Mars, Lingjia Tang, D. Tullsen","doi":"10.1109/HPCA.2019.00066","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00066","url":null,"abstract":"The increasing demand for computational power has led to the creation and deployment of large-scale data centers. During the last few years, data centers have seen improvements aimed at increasing computational density – the amount of throughput that can be achieved within the allocated physical footprint. This need to pack more compute in the same physical space has led to density optimized server designs. Density optimized servers push compute density significantly beyond what can be achieved by blade servers by using innovative modular chassis based designs. This paper presents a comprehensive analysis of the impact of socket density on intra-server thermals and demonstrates that increased socket density inside the server leads to large temperature variations among sockets due to inter-socket thermal coupling. The paper shows that traditional chip-level and data center-level temperature-aware scheduling techniques do not work well for thermally-coupled sockets. The paper proposes new scheduling techniques that account for the thermals of the socket a task is scheduled on, as well as thermally coupled nearby sockets. The proposed mechanisms provide 2.5% to 6.5% performance improvements across various workloads and as much as 17% over traditional temperature-aware schedulers for computation-heavy workloads. Keywords-Server; Data center; Density Optimized Server; Scheduling","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132293164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Kelp: QoS for Accelerated Machine Learning Systems 加速机器学习系统的QoS
Haishan Zhu, David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, M. Erez
{"title":"Kelp: QoS for Accelerated Machine Learning Systems","authors":"Haishan Zhu, David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, M. Erez","doi":"10.1109/HPCA.2019.00036","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00036","url":null,"abstract":"Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning, which has become increasingly common as model sizes continue to grow. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Reliability Evaluation of Mixed-Precision Architectures 混合精度体系结构的可靠性评估
F. Santos, Caio B. Lunardi, Daniel Oliveira, F. Libano, P. Rech
{"title":"Reliability Evaluation of Mixed-Precision Architectures","authors":"F. Santos, Caio B. Lunardi, Daniel Oliveira, F. Libano, P. Rech","doi":"10.1109/HPCA.2019.00041","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00041","url":null,"abstract":"","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133871939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信