Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, C. Kozyrakis, Xuehai Qian
{"title":"GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition","authors":"Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, C. Kozyrakis, Xuehai Qian","doi":"10.1109/HPCA.2018.00053","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00053","url":null,"abstract":"Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of “big data” and 3D stacking technology make PIM a practical and viable solution for the modern data processing workloads. It is exemplified by the recent research interests on PIM-based acceleration. Among them, TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron’s Hybrid Memory Cube (HMC), one of the most prominent 3D-stacked memory technologies. It implements a Pregel-like vertex-centric programming model, so that users could develop programs in the familiar interface while taking advantage of PIM. Despite the orders of magnitude speedup compared to DRAM-based systems, TESSERACT generates excessive crosscube communications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Our investigation indicates that this is because of the restricted data organization required by the vertex programming model. In this paper, we argue that a PIM-based graph processing system should take data organization as a first-order design consideration. Following this principle, we propose GraphP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to TESSERACT. GraphP features three key techniques. 1) “Source-cut” partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities offered by the proposed partitioning and programming model. We evaluate GraphP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to TESSERACT.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123717161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamid Tabani, J. Arnau, Jordi Tubella, Antonio González
{"title":"A Novel Register Renaming Technique for Out-of-Order Processors","authors":"Hamid Tabani, J. Arnau, Jordi Tubella, Antonio González","doi":"10.1109/HPCA.2018.00031","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00031","url":null,"abstract":"Modern superscalar processors support a large number of in-flight instructions, which requires sizeable register files. Conventional register renaming techniques allocate a new storage location, i.e. physical register, for every instruction whose destination is a logical register in order to remove false dependences. Physical registers are released in a conservative manner when the same logical register is redefined. For this reason, many cycles may happen between the last read and the release of a physical register, leading to suboptimal utilization of the register file. We have observed that for more than 50% of the instructions in SPECfp and more than 30% of the instructions in SPECint that have a destination register, the produced value has only a single consumer. In this case, the RAW dependence guarantees that the producer-consumer instructions pair will be executed in program order and, hence, the same physical register can be used to store the value produced by both instructions. In this paper, we propose a renaming technique that exploits this property to reduce the pressure on the register file. Our technique leverages physical register sharing by introducing minor changes in the register map table and the issue queue. We also describe how our renaming scheme supports precise exceptions. We evaluated our renaming technique on top of a modern out-of-order processor. Our experimental results show that it provides 6% speedup on average for the SPEC2006 benchmarks. Alternatively, our renaming scheme achieves the same performance while reducing the number of physical registers by 10.5%.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121067534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scott Van Winkle, Avinash Karanth Kodi, Razvan C. Bunescu, A. Louri
{"title":"Extending the Power-Efficiency and Performance of Photonic Interconnects for Heterogeneous Multicores with Machine Learning","authors":"Scott Van Winkle, Avinash Karanth Kodi, Razvan C. Bunescu, A. Louri","doi":"10.1109/HPCA.2018.00048","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00048","url":null,"abstract":"As communication energy exceeds computation energy in future technologies, traditional on-chip electrical interconnects face fundamental challenges in the many-core era. Photonic interconnects have been proposed as a disruptive technology solution due to superior performance per Watt, distance independent energy consumption and CMOS compatibility for on-chip interconnects. Static power due to the laser being always switched on, varying link utilization due to spatial and temporal traffic fluctuations and thermal sensitivity are some of the critical challenges facing photonics interconnects. In this paper, we propose photonic interconnects for heterogeneous multicores using a checkerboard pattern that clusters CPU-GPU cores together and implements bandwidth reconfiguration using local router information without global coordination. To reduce the static power, we also propose a dynamic laser scaling technique that predicts the power level for the next epoch using the buffer occupancy of previous epoch. To further improve power-performance trade-offs, we also propose a regression-based machine learning technique for scaling the power of the photonic link. Our simulation results demonstrate a 34% performance improvement over a baseline electrical CMESH while consuming 25% less energy per bit when dynamically reallocating bandwidth. When dynamically scaling laser power, our buffer-based reactive and ML-based proactive prediction techniques show 40 - 65% in power savings with 0 - 14% in throughput loss depending on the reservation window size.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116262108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"G-TSC: Timestamp Based Coherence for GPUs","authors":"Abdulaziz Tabbakh, Xuehai Qian, M. Annavaram","doi":"10.1109/HPCA.2018.00042","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00042","url":null,"abstract":"Cache coherence has been studied extensively in the context of chip multiprocessors (CMP). It is well known that conventional directory-based and snooping coherence protocols generate considerable coherence traffic as the number of hardware thread contexts increase. Since GPUs support hundreds or even thousands of threads, conventional coherence mechanisms when applied to GPUs will exacerbate the the bandwidth constraints that GPUs already face. Recognizing this constraint, prior work has proposed time-based coherence protocols. The main idea is to assign a lease period to the accessed cache block, and after the lease expires the cache block is self-invalidated. However, time-based coherence protocols require global synchronized clocks. Furthermore, this approach may increase execution stalls since threads have to wait to access data with an unexpired lease. Tardis is timestamp-based coherence protocol that has been proposed recently to alleviate the need for global clocks in CPUs. This paper builds on this prior work and proposes G-TSC, a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time. This work demonstrates the challenges in adopting timestamp coherence for GPUs which support massive thread parallelism and have unique microarchitecture features. This work then presents a number of solutions that tackle GPU-centric challenges. Evaluation of G-TSC implemented in the GPGPU-Sim simulation framework shows that G-TSC outperforms time-based coherence by 38% with release consistency.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123085926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Amdahl's Law in the Datacenter Era: A Market for Fair Processor Allocation","authors":"S. Zahedi, Qiuyun Llull, Benjamin C. Lee","doi":"10.1109/HPCA.2018.00011","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00011","url":null,"abstract":"We present a processor allocation framework that uses Amdahl's Law to model parallel performance and a market mechanism to allocate cores. First, we propose the Amdahl utility function and demonstrate its accuracy when modeling performance from processor core allocations. Second, we design a market based on Amdahl utility that optimizes users' bids for processors based on workload parallelizability. The framework uses entitlements to guarantee fairness yet outperforms existing proportional share algorithms.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121727489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fawaz Alazemi, Arash AziziMazreah, B. Bose, Lizhong Chen
{"title":"Routerless Network-on-Chip","authors":"Fawaz Alazemi, Arash AziziMazreah, B. Bose, Lizhong Chen","doi":"10.1109/HPCA.2018.00049","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00049","url":null,"abstract":"Traditional bus-based interconnects are simple and easy to implement, but the scalability is greatly limited. While router-based networks-on-chip (NoCs) offer superior scalability, they also incur significant power and area overhead due to complex router structures. In this paper, we explore a new class of on-chip networks, referred to as textit{Routerless NoCs}, where routers are completely eliminated. We propose a novel design that utilizes on-chip wiring resources smartly to achieve comparable hop count and scalability as router-based NoCs. Several effective techniques are also proposed that significantly reduce the resource requirement to avoid new network abnormalities in routerless NoC designs. Evaluation results show that, compared with a conventional mesh, the proposed routerless NoC achieves 9.5X reduction in power, 7.2X reduction in area, 2.5X reduction in zero-load packet latency, and 1.7X increase in throughput. Compared with a state-of-the-art low-cost NoC design, the proposed approach achieves 7.7X reduction in power, 3.3X reduction in area, 1.3X reduction in zero-load packet latency, and 1.6X increase in throughput.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124018034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Wang, Shuo Li, Guangyu Sun, Xiaoyang Wang, Yiran Chen, Hai Helen Li, J. Cong, Nong Xiao, Zhang Tao
{"title":"RC-NVM: Enabling Symmetric Row and Column Memory Accesses for In-memory Databases","authors":"Peng Wang, Shuo Li, Guangyu Sun, Xiaoyang Wang, Yiran Chen, Hai Helen Li, J. Cong, Nong Xiao, Zhang Tao","doi":"10.1109/HPCA.2018.00051","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00051","url":null,"abstract":"Ever increasing DRAM capacity has fostered the development of in-memory databases (IMDB). The massive performance improvements provided by IMDBs have enabled transactions and analytics on the same database. In other words, the integration of OLTP (on-line transactional processing) and OLAP (on-line analytical processing) systems is becoming a general trend. However, conventional DRAM-based main memory is optimized for row-oriented accesses generated by OLTP workloads in row-based databases. OLAP queries scanning on specified columns cause so-called strided accesses and result in poor memory performance. Since memory access latency dominates in IMDB processing time, it can degrade overall performance significantly. To overcome this problem, we propose a dual-addressable memory architecture based on non-volatile memory, called RC-NVM, to support both row-oriented and column-oriented accesses. We first present circuit-level analysis to prove that such a dual-addressable architecture is only practical with RC-NVM rather than DRAM technology. Then, we rethink the addressing schemes, data layouts, cache synonym, and coherence issues of RC-NVM in architectural level to make it applicable for IMDBs. Finally, we propose a group caching technique that combines the IMDB knowledge with the memory architecture to further optimize the system. Experimental results show that the memory access performance can be improved up to 14.5X with only 15% area overhead.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125551899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Wait of a Decade: Did SPEC CPU 2017 Broaden the Performance Horizon?","authors":"Reena Panda, Shuang Song, Joseph Dean, L. John","doi":"10.1109/HPCA.2018.00032","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00032","url":null,"abstract":"The recently released SPEC CPU2017 benchmark suite has already started receiving a lot of attention from both industry and academic communities. However, due to the significantly high size and complexity of the benchmarks, simulating all the CPU2017 benchmarks for design trade-off evaluation is likely to become extremely difficult. Simulating a randomly selected subset, or a random input set, may result in misleading conclusions. This paper analyzes the SPEC CPU2017 benchmarks using performance counter based experimentation from seven commercial systems, and uses statistical techniques such as principal component analysis and clustering to identify similarities among benchmarks. Such analysis can reveal benchmark redundancies and identify subsets for researchers who cannot use all benchmarks in pre-silicon design trade-off evaluations. Many of the SPEC CPU2006 benchmarks have been replaced with larger and complex workloads in the SPEC CPU2017 suite. However, compared to CPU2006, it is unknown whether SPEC CPU2017 benchmarks have different performance demands or whether they stress machines differently. Additionally, to evaluate the balance of CPU2017 benchmarks, we analyze the performance characteristics of CPU2017 workloads and compare them with emerging database, graph analytics and electronic design automation (EDA) workloads. This paper provides the first detailed analysis of SPEC CPU2017 benchmark suite for the architecture community.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132400044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Amdahl's Law in Big Data Analytics: Alive and Kicking in TPCx-BB (BigBench)","authors":"Daniel Richins, Tahrina Ahmed, R. Clapp, V. Reddi","doi":"10.1109/HPCA.2018.00060","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00060","url":null,"abstract":"Big data, specifically data analytics, is responsible for driving many of consumers' most common online activities, including shopping, web searches, and interactions on social media. In this paper, we present the first (micro)architectural investigation of a new industry-standard, open source benchmark suite directed at big data analytics applications—TPCx-BB (BigBench). Where previous work has usually studied benchmarks which oversimplify big data analytics, our study of BigBench reveals that there is immense diversity among applications, owing to their varied data types, computational paradigms, and analyses. In our analysis, we also make an important discovery generally restricting processor performance in big data. Contrary to conventional wisdom that big data applications lend themselves naturally to parallelism, we discover that they lack sufficient thread-level parallelism (TLP) to fully utilize all cores. In other words, they are constrained by Amdahl's law. While TLP may be limited by various factors, ultimately we find that single-thread performance is as relevant in scale-out workloads as it is in more classical applications. To this end we present core packing: a software and hardware solution that could provide as much as 20% execution speedup for some big data analytics applications.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133437964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling Efficient Network Service Function Chain Deployment on Heterogeneous Server Platform","authors":"Yang Hu, Tao Li","doi":"10.1109/HPCA.2018.00013","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00013","url":null,"abstract":"Network Function Virtualization (NFV) aims to run software-implemented network functions on general hardware such as Commodity Off-the-Shelf (COTS) servers to trade the application-specific performance with generality and re-configurability. Nevertheless, with the wide adoption of general accelerators such as GPU, the researchers seek to boost the performance of software-based network functions while trying to maintain the reusability and programmability in the meantime. The Service Function Chain (SFC) is a key enabler of service flexibility of NFV. The network functions stitch into a chain to provide differentiated services to multi-tenants. However, our characterization results show that existing heterogeneous packet processing frameworks do not handle NFV SFC well since two new overheads, the aggregated processing overheads and co-existence interference overheads, are introduced by SFC.,,,, Motivated by our characterization, we propose NFCompass, a runtime framework that employs SFC re-organization technique and graph-partition based task scheduling technique to conquer the two challenges brought by SFC. By re-organizing the SFC components, the length and complexity of processing paths are reduced and the aggregated overheads are mitigated. By applying the graph-partition based task allocation, better load balance is achieved and the data transfer overheads are considerably reduced.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125414598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}