{"title":"Operating systems for many-core systems","authors":"Hendrik Borghorst, O. Spinczyk","doi":"10.1049/pbpc022e_ch3","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch3","url":null,"abstract":"The ongoing trend toward many-core computer systems and adequate new programming models has spawned numerous new activities in the domain of operating system (OS) research during recent years. This chapter will address the challenges and opportunities for OS developers in this new field and give an overview of state-of-the-art research.This section will introduce the reader to the spectrum of contemporary many-core CPU architectures, application programming models for many-core systems, give a brief overview of the resulting challenges for OS developers.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121694724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, S. Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, O. Mutlu
{"title":"Decoupling the programming model from resource management in throughput processors","authors":"Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, S. Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, O. Mutlu","doi":"10.1049/pbpc022e_ch4","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch4","url":null,"abstract":"This chapter introduces a new resource virtualization framework, Zorua, that decouples the graphics processing unit (GPU) programming model from the management of key on-chip resources in hardware to enhance programming ease, portability, and performance. The application resource specification-a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block-forms a critical component of the existing GPU programming models. This specification determines the parallelism, and, hence, performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely based on this specification. This tight coupling between the software-provided resource specification and resource management in hardware leads to significant challenges in programming ease, portability, and performance, as we demonstrate in this chapter using real data obtained on state-of-the-art GPU systems. Our goal in this work is to reduce the dependence of performance on the software-provided static resource specification to simultaneously alleviate the above challenges. To this end, we introduce Zorua, a new resource virtualization framework, that decouples the programmer-specified resource usage of a GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing each resource transparently to the programmer. The virtualization provided by Zorua builds on two key concepts-dynamic allocation of the on-chip resources and their oversubscription using a swap space in memory. Zorua provides a holistic GPU resource virtualization strategy designed to (i) adaptively control the extent of oversubscription and (ii) coordinate the dynamic management of multiple on-chip resources to maximize the effectiveness of virtualization.We demonstrate that by providing the illusion of more resources than physically available via controlled and coordinated virtualization, Zorua offers several important benefits: (i) Programming ease. It eases the burden on the programmer to provide code that is tuned to efficiently utilize the physically available on-chip resources. (ii) Portability. It alleviates the necessity of retuning an application's resource usage when porting the application across GPU generations. (iii) Performance. By dynamically allocating resources and carefully oversubscribing them when necessary, Zorua improves or retains the performance of applications that are already highly tuned to best utilize the resources. The holistic virtualization provided by Zorua has many other potential uses, e.g., fine-grained resource sharing among multiple kernels, low latency preemption of GPU programs, and support for dynamic parallelism, which we describe in this chapter.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127764605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From irregular heterogeneous software to reconfigurable hardware","authors":"John Wickerson, G. Constantinides","doi":"10.1049/pbpc022e_ch2","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch2","url":null,"abstract":"A heterogeneous system is the one that incorporates more than one kind of computing device. Such a system can offer better performance per Watt than a homogeneous one if the applications it runs are programmed to take advantage of the different strengths of the different devices in the system. A typical heterogeneous setup involves a master processor (the `host' CPU) offloading some easily parallelised computations to a graphics processing unit (GPU) or to a custom accelerator implemented on a field-programmable gate array (FPGA).This arrangement can benefit performance because it exploits the massively parallel natures of GPU and FPGA architectures.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132000602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asieh Salehi Fathabadi, Mohammadsadegh Dalvandi, M. Butler
{"title":"Developing portable embedded software for multicore systems through formal abstraction and refinement","authors":"Asieh Salehi Fathabadi, Mohammadsadegh Dalvandi, M. Butler","doi":"10.1049/PBPC022E_CH14","DOIUrl":"https://doi.org/10.1049/PBPC022E_CH14","url":null,"abstract":"Run-time management (RTM) systems are used in embedded systems to dynamically adapt hardware performance to minimise energy consumption. An RTM system implementation is coupled with the hardware platform specifications and is implemented individually for each specific platform. A significant challenge is that RTM software can require laborious manual adjustment across different hardware platforms due to the diversity of architecture characteristics. Hardware specifications vary from one platform to another and include a number of characteristic such as the number of supported voltage and frequency (VF) settings. Formal modelling offers the potential to simplify the management of platform diversity by shifting the focus away from handwritten platform-specific code to platform-independent models from which platform-specific implementations are automatically generated. The article presents an overview of the motivations for this work. It goes on to overview the RTM architecture and requirements and introduce the Event-B formal method and its tool support. The article then describes the Event-B model of two different RTMs and presents the portability support provided by formal modelling and code generation. Finalyy, it reviews the verification and experimental results.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117315290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling many-core architectures","authors":"Guihai Yan, Jiajun Li, L. Xiaowei","doi":"10.1049/pbpc022e_ch12","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch12","url":null,"abstract":"Architectural modelling has two primary objectives: (1) navigating the design space exploration, i.e. guiding the architects to arrival at better design choices, and (2) facilitating dynamic management, i.e. providing the functional relationships between workloads'characteristics and architectural configurations to enable appropriate runtime hardware/software adaptations. In the past years, many-core architectures, as a typical computing fabric evolving from the monolithic single-/multicore architectures, have been shown to be scalable to uphold the staggering the Moore's Law. The many-core architectures enable two orthogonal approaches, scale-up and scale-out, to utilize the growing budget of transistors. Understanding the rationale behind these approaches is critical to make more efficient use of the powerful computing fabric.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126393657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advances in power management of many-core processors","authors":"Andrea Bartolini, D. Rossi","doi":"10.1049/pbpc022e_ch8","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch8","url":null,"abstract":"","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129805102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Biologically-inspired massively-parallel computing","authors":"S. Furber","doi":"10.1049/pbpc022e_ch22","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch22","url":null,"abstract":"Half a century of progress in computer technology has delivered machines of formidable capability and an expectation that similar advances will continue into the foreseeable future. However, much of the past progress has been driven by developments in semiconductor technology following Moore's Law, and there are strong grounds for believing that these cannot continue at the same rate. This, and related issues, suggest that there are huge challenges ahead in meeting the expectations of future progress, such as understanding how to exploit massive parallelism and how to deliver improvements in energy efficiency and reliability in the face of diminishing component reliability. Alongside these issues, recent advances in machine learning have created a demand for machines with cognitive capabilities, for example, to control autonomous vehicles, that we will struggle to deliver. Biological systems have, through evolution, found solutions to many of these problems, but we lack a fundamental understanding of how these solutions function. If we could advance our understanding of biological systems, we would open a rich source of ideas for unblocking progress in our engineered systems. An overview is given of SpiNNaker - a spiking neural network architecture. The SpiNNaker machine puts these principles together in the form of a massively parallel computer architecture designed both to model the biological brain, in order to accelerate our understanding of its principles of operation, and also to explore engineering applications of such machines.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117192633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Yang, Zhehui Wang, Zhifei Wang, Xuanqi Chen, Luan H. K. Duong, Jiang Xu
{"title":"Silicon photonics enabled rack-scale many-core systems","authors":"Peng Yang, Zhehui Wang, Zhifei Wang, Xuanqi Chen, Luan H. K. Duong, Jiang Xu","doi":"10.1049/pbpc022e_ch18","DOIUrl":"https://doi.org/10.1049/pbpc022e_ch18","url":null,"abstract":"The increasingly higher demands on computing power from scientific computations, big data processing and deep learning are pushing the emergence of exascale computing systems. Tens of thousands of or even more manycore nodes are connected to build such systems. It imposes huge performance and power challenges on different aspects of the systems. As a basic block in high-performance computing systems, modularized rack will play a significant role in addressing these challenges. In this chapter, we introduce rack-scale optical networks (RSON), a silicon photonics enabled inter/intra-chip network for rack-scale many-core systems. RSON leverages the fact that most traffic is within rack and the high bandwidth and low-latency rack-scale optical network can improve both performance and energy efficiency. We codesign the intra-chip and inter-chip optical networks together with optical internode interface to provide balanced data access to both local memory and remote note's memory, making the nodes within rack cooperate effectively. The evaluations show that RSON can improve the overall performance and energy efficiency dramatically. Specifically, RSON can deliver as much as 5.4x more performance under the same energy consumption compared to traditional InfiniBand connected rack.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123740869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Singh, P. Dziurzański, G. Merrett, B. Al-Hashimi
{"title":"Tools and workloads for many-core computing","authors":"A. Singh, P. Dziurzański, G. Merrett, B. Al-Hashimi","doi":"10.1049/PBPC022E_CH5","DOIUrl":"https://doi.org/10.1049/PBPC022E_CH5","url":null,"abstract":"Proper tools and workloads are required to evaluate any computing systems. This enables designers to fulfill the desired properties expected by the end-users. It can be observed that multi/many-core chips are omnipresent from small-to-large-scale systems, such as mobile phones and data centers. The reliance on multi/many-core chips is increasing as they provide high-processing capability to meet the increasing performance requirements of complex applications in various application domains. The high-processing capability is achieved by employing parallel processing on the cores where the application needs to be partitioned into a number of tasks or threads and they need to be efficiently allocated onto different cores. The applications considered for evaluations represent workloads and toolchains required to facilitate the whole evaluation are referred to as tools. The tools facilitate realization of different actions (e.g., thread-to-core mapping and voltage/frequency control, which are governed by OS scheduler and power governor, respectively) and their effect on different performance monitoring counters leading to a change in the performance metrics (e.g., energy consumption and execution time) concerned by the end-users.","PeriodicalId":254920,"journal":{"name":"Many-Core Computing: Hardware and Software","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}