Josué Feliu, Stijn Eyerman, J. Sahuquillo, S. Petit
{"title":"Symbiotic job scheduling on the IBM POWER8","authors":"Josué Feliu, Stijn Eyerman, J. Sahuquillo, S. Petit","doi":"10.1109/HPCA.2016.7446103","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446103","url":null,"abstract":"Simultaneous multithreading (SMT) processors share most of the microarchitectural core components among the co-running applications. The competition for shared resources causes performance interference between applications. Therefore, the performance benefits of SMT processors heavily depend on the complementarity of the co-running applications. Symbiotic job scheduling, i.e., scheduling applications that co-run well together on a core, can have a considerable impact on the performance of a processor with SMT cores. Prior work uses sampling or novel hardware support to perform symbiotic job scheduling, which has either a non-negligible overhead or is impossible to use on existing hardware. This paper proposes a symbiotic job scheduler for the IBM POWER8 processor. We leverage the existing cycle accounting mechanism to predict symbiosis between applications, and use that information at run-time to decide which applications should run on the same core or on separate cores. We implement the scheduler in the Linux operating system and evaluate it on an IBM POWER8 server running multiprogrammed workloads. The symbiotic job scheduler significantly improves performance compared to both an agnostic random scheduler and the default Linux scheduler. With respect to Linux, it achieves an average speedup by 8.8% for workloads comprising 12 applications, and by 4.7% on average across all evaluated workloads.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125738136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jieming Yin, Onur Kayiran, Matthew Poremba, Natalie D. Enright Jerger, G. Loh
{"title":"Efficient synthetic traffic models for large, complex SoCs","authors":"Jieming Yin, Onur Kayiran, Matthew Poremba, Natalie D. Enright Jerger, G. Loh","doi":"10.1109/HPCA.2016.7446073","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446073","url":null,"abstract":"The interconnect or network on chip (NoC) is an increasingly important component in processors. As systems scale up in size and functionality, the ability to efficiently model larger and more complex NoCs becomes increasingly important to the design and evaluation of such systems. Recent work proposed the \"SynFull\" methodology that performs statistical analysis of a workload's NoC traffic to create compact traffic generators based on Markov models. While the models generate synthetic traffic, the traffic is statistically similar to the original trace and can be used for fast NoC simulation. However, the original SynFull work only evaluated multi-core CPU scenarios with a very simple cache coherence protocol (MESI). We find the original SynFull methodology to be insufficient when modeling the NoC of a more complex system on a chip (SoC). We identify and analyze the shortcomings of SynFull in the context of a SoC consisting of a heterogeneous architecture (CPU and GPU), a more complex cache hierarchy including support for full coherence between CPU, GPU, and shared caches, and heterogeneous workloads. We introduce new techniques to address these shortcomings. Furthermore, the original SynFull methodology can only model a NoC with N nodes when the original application analysis is performed on an identically-sized N-node system, but one typically wants to model larger future systems. Therefore, we introduce new techniques to enable SynFull-like analysis to be extrapolated to model such larger systems. Finally, we present a novel synthetic memory reference model to replace SynFull's fixed latency model; this allows more realistic evaluation of the memory subsystem and its interaction with the NoC. The result is a robust NoC simulation methodology that works for large, heterogeneous SoC architectures.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134270911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungyong Seo, Youngjin Cho, Y. Yoo, Otae Bae, Jaegeun Park, Heehyun Nam, Sunmi Lee, Yongmyung Lee, Seungdo Chae, Moonsang Kwon, Jin-Hyeok Choi, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang
{"title":"Design and implementation of a mobile storage leveraging the DRAM interface","authors":"Sungyong Seo, Youngjin Cho, Y. Yoo, Otae Bae, Jaegeun Park, Heehyun Nam, Sunmi Lee, Yongmyung Lee, Seungdo Chae, Moonsang Kwon, Jin-Hyeok Choi, Sangyeun Cho, Jaeheon Jeong, Duckhyun Chang","doi":"10.1109/HPCA.2016.7446092","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446092","url":null,"abstract":"Storage I/O performance remains a key factor that determines the overall user experience of a computer system. This is especially true for mobile systems as users commonly browse and navigate through many high-quality pictures and video clips stored in their device. The appetite for more appealing user interface has continuously pushed the mobile storage interface speed up; emerging UFS 2.0 standard provisions a maximum bandwidth of as high as 1,200 MB/s. In this work, we propose, design, and implement a mobile storage architecture that leverages the high-speed DRAM interface for communication, thus substantially expanding the storage performance headroom. In order to effectively turn the existing DRAM interface into a storage interface, we design a new storage protocol that runs on top of the DRAM interface. Our protocol builds on a small host interface buffer structure mapped to the system's memory space. Based on this protocol, we develop and fabricate a storage controller chip that natively supports the LPDDR3 interface. We also develop a host software stack (Linux device driver and boot loader) and a host platform board. Finally we show the feasibility of our proposal by constructing a full Android system running on the developed storage device and platform. Our detailed evaluation shows that the proposed storage architecture has very low protocol handling overheads and compares favorably to a UFS 2.0 device. The proposed architecture obviates the need for implementing a separate host-side storage controller on a mobile CPU chip.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131003864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning","authors":"M. N. Bojnordi, Engin Ipek","doi":"10.1109/HPCA.2016.7446049","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446049","url":null,"abstract":"The Boltzmann machine is a massively parallel computational model capable of solving a broad class of combinatorial optimization problems. In recent years, it has been successfully applied to training deep machine learning models on massive datasets. High performance implementations of the Boltzmann machine using GPUs, MPI-based HPC clusters, and FPGAs have been proposed in the literature. Regrettably, the required all-to-all communication among the processing units limits the performance of these efforts. This paper examines a new class of hardware accelerators for large-scale combinatorial optimization and deep learning based on memristive Boltzmann machines. A massively parallel, memory-centric hardware accelerator is proposed based on recently developed resistive RAM (RRAM) technology. The proposed accelerator exploits the electrical properties of RRAm to realize in situ, fine-grained parallel computation within memory arrays, thereby eliminating the need for exchanging data between the memory cells and the computational units. Two classical optimization problems, graph partitioning and boolean satisfiability, and a deep belief network application are mapped onto the proposed hardware. As compared to a multicore system, the proposed accelerator achieves 57x higher performance and 25x lower energy with virtually no loss in the quality of the solution to the optimization problems. The memristive accelerator is also compared against an RRAM based processing-in-memory (PIM) system, with respective performance and energy improvements of 6.89x and 5.2x.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"115 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124412651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A low-power hybrid reconfigurable architecture for resistive random-access memories","authors":"M. Lastras-Montaño, A. Ghofrani, K. Cheng","doi":"10.1109/HPCA.2016.7446057","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446057","url":null,"abstract":"Access-transistor-free memristive crossbars have shown to be excellent candidates for next generation non-volatile memories. While the elimination of the transistor per memory element enables higher memory densities, it also introduces parasitic currents during the normal operation of the memory that increases both the overall power consumption of the crossbar, and the current requirements of the line drivers. In this work we present a hybrid reconfigurable memory architecture that takes advantage of the fact that a complementary resistive switch (CRS) can behave both as a memristor and as a CRS. By dynamically keeping frequently accessed regions of the memory in the memristive mode and others in the CRS mode, our hybrid memory offer all the benefits that a memristor and a CRS offer individually, without any of their drawbacks. We validate our architecture using the SPEC CPU2006 benchmark and found that our hybrid memory offers average energy savings of 3.6x with respect to a memristive-only memory. In addition, we can offer a memory lifetime that is, on average, 6.4x longer than that of a CRS-only memory.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115998385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LiveSim: Going live with microarchitecture simulation","authors":"Sina Hassani, G. Southern, Jose Renau","doi":"10.1109/HPCA.2016.7446098","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446098","url":null,"abstract":"Computer architects rely heavily on software-based microarchitecture simulators, which typically take hours or days to produce results. We have developed LiveSim, a novel microarchitectural simulation methodology that provides simulation results within seconds, making it suitable for interactive use. LiveSim works by creating in-memory checkpoints of application state, and then executing randomly selected samples from these checkpoints in parallel to produce simulation results. The initial results, which we call LiveSample, are reported less than one second after starting the simulation. As more samples are simulated the results become more accurate and are updated in real-time. Once enough samples are gathered, LiveSim provides confidence intervals for the reported values and continues simulation until it reaches the target confidence level, which we call LiveCI. We evaluated LiveSim using SPEC CPU 2006 benchmarks and found that within 5 seconds after starting simulation, LiveSample results reached an average error of 3.51% compared to full simulation, and the LiveCI results were available within 41 seconds on average.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122397557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianbo Dong, Rui Hou, Michael C. Huang, Tao Jiang, Boyan Zhao, S. Mckee, Haibin Wang, Xiaosong Cui, Lixin Zhang
{"title":"Venice: Exploring server architectures for effective resource sharing","authors":"Jianbo Dong, Rui Hou, Michael C. Huang, Tao Jiang, Boyan Zhao, S. Mckee, Haibin Wang, Xiaosong Cui, Lixin Zhang","doi":"10.1109/HPCA.2016.7446090","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446090","url":null,"abstract":"Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123820948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction","authors":"Matthew Halpern, Yuhao Zhu, V. Reddi","doi":"10.1109/HPCA.2016.7446054","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446054","url":null,"abstract":"In this paper, we assess the past, present, and future of mobile CPU design. We study how mobile CPU designs trends have impacted the end-user, hardware design, and the holistic mobile device. We analyze the evolution often cutting-edge mobile CPU designs released over the past seven years. Specifically, we report measured performance, power, energy and user satisfaction trends across mobile CPU generations. A key contribution of our work is that we contextualize the mobile CPU's evolution in terms of user satisfaction, which has largely been absent from prior mobile hardware studies. To bridge the gap between mobile CPU design and user satisfaction, we construct and conduct a novel crowdsourcing study that spans over 25,000 survey participants using the Amazon Mechanical Turk service. Our methodology allows us to identify what mobile CPU design techniques provide the most benefit to the end-user's quality of user experience. Our results quantitatively demonstrate that CPUs play a crucial role in modern mobile system-on-chips (SoCs). Over the last seven years, both single-and multicore performance improvements have contributed to end-user satisfaction by reducing user-critical application response latencies. Mobile CPUs aggressively adopted many power-hungry desktop-oriented design techniques to reach these performance levels. Unlike other smartphone components (e.g. display and radio) whose peak power consumption has decreased over time, the mobile CPU's peak power consumption has steadily increased. As the limits of technology scaling restrict the ability of desktop-like scaling to continue for mobile CPUs, specialized accelerators appear to be a promising alternative that can help sustain the power, performance, and energy improvements that mobile computing necessitates. Such a paradigm shift will redefine the role of the CPU within future SoCs, which merit several design considerations based on our findings.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127306140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines","authors":"Wei Wang, J. Davidson, M. Soffa","doi":"10.1109/HPCA.2016.7446083","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446083","url":null,"abstract":"Modern NUMA platforms offer large numbers of cores to boost performance through parallelism and multi-threading. However, because performance scalability is limited by available memory bandwidth, the strategy of allocating all cores can result in degraded performance. Consequently, accurately predicting optimal (best performing) core allocations, and executing applications with these allocations are crucial for achieving the best performance. Previous research focused on the prediction of optimal numbers of cores. However, in this paper, we show that, because of the asymmetric NUMA memory configuration and the asymmetric application memory behavior, optimal core allocations are not merely optimal numbers of cores. Additionally, previous studies do not adequately consider NUMA memory resources, which further limits their ability to accurately predict optimal core allocations. In this paper, we present a model, NuCore, which predicts both memory bandwidth usage and optimal core allocations. NuCore considers various memory resources and NUMA asymmetry, and employs Integer Programming to achieve high accuracy and low overhead. Experimental results from real NUMA machines show that the core allocations predicted by NuCore provide 1.27x average speedup over using all cores with only 75.6% cores allocated. NuCore also provides 1.18x and 1.21x average speedups over two state-of-the-art techniques. Our results also show that NuCore faithfully models NUMA memory systems and predicts memory bandwidth usages with only 10% average error.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122312566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Amdahl's law for lifetime reliability scaling in heterogeneous multicore processors","authors":"William J. Song, S. Mukhopadhyay, S. Yalamanchili","doi":"10.1109/HPCA.2016.7446097","DOIUrl":"https://doi.org/10.1109/HPCA.2016.7446097","url":null,"abstract":"Heterogeneous multicore processors have been suggested as alternative microarchitectural designs to enhance performance and energy efficiency. Using Amdahl's Law, heterogeneous models were primarily analyzed in performance and energy efficiency aspects to demonstrate its advantage over conventional homogeneous systems. In this paper, we further extend the study to understand the lifetime reliability consequences of heterogeneous multicore processors, as reliability becomes an increasingly important constraint. We present the lifetime reliability models of multicore processors based on Amdahl's Law, including compact thermal estimation that has strong correlation with device aging. Lifetime reliability is analyzed by varying i) core utilization (Amdahl's scaling factor), ii) processor composition (number of big and small cores), and iii) thread scheduling method. The study shows that the heterogeneous processor may have a serious reliability challenge. If the processor is comprised of only one big core and many small cores, stresses can be biased to the big core especially when workloads spend more time on sequential operations. Our study reveals that incorporating multiple big cores can mitigate reliability bottleneck in big cores and enhance processor lifetime, but adding too many big cores will have an adverse impact on lifetime reliability as well as performance.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117187631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}