Stéphane Zuckerman, A. Landwehr, Kelly Livingston, G. Gao
{"title":"Toward a Self-Aware Codelet Execution Model","authors":"Stéphane Zuckerman, A. Landwehr, Kelly Livingston, G. Gao","doi":"10.1109/DFM.2014.12","DOIUrl":"https://doi.org/10.1109/DFM.2014.12","url":null,"abstract":"Future extreme-scale supercomputers will feature arrays of general-purpose and specialized many-core processors, totaling thousands of cores on a single chip. In general, many-core chips will most likely resemble a \"hierarchical and distributed system on chip.\" It is expected that such systems will be hard to exploit not only for performance, but will also need to deal with reliability issues, as well as power and energy issues. The Codelet Model is a fine-grain dataflow-inspired and event-driven program execution model which was designed to run parallel programs on a combination of such many-core chips into a supercomputer. Meanwhile, some on-going work is attempting to take into account user goals as well as resource usage and make the system \"self-aware:\" By using introspective means, this kind of research tries to have the system software modify the state of the overall system at run-time to satisfy the user goals. It is very likely that future extreme-scale systems will be in constant demand of different kinds of resources, may they be processing elements (general purpose or otherwise), bandwidth, power budget, etc. This paper takes the position that a potential solution to solve the resource management issue at this scale is a hierarchical and distributed self-aware system leveraging the fine-grain event-driven codelet threading model.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122264412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stéphane Zuckerman, Haitao Wei, G. Gao, H. Wong, J. Gaudiot, A. Louri
{"title":"A Holistic Dataflow-Inspired System Design","authors":"Stéphane Zuckerman, Haitao Wei, G. Gao, H. Wong, J. Gaudiot, A. Louri","doi":"10.1109/DFM.2014.16","DOIUrl":"https://doi.org/10.1109/DFM.2014.16","url":null,"abstract":"Computer systems have undergone a fundamental transformation recently, from single-core processors to devices with increasingly higher core counts within a single chip. The semi-conductor industry now faces the infamous power and utilization walls. To meet these challenges, heterogeneity in design, both at the architecture and technology levels, will be the prevailing approach for energy efficient computing as specialized cores, accelerators, etc., can eliminate the energy overheads of general-purpose homogeneous cores. However, with future technological challenges pointing in the direction of on-chip heterogeneity, and because of the traditional difficulty of parallel programming, it becomes imperative to produce new system software stacks that can take advantage of the heterogeneous hardware. As a case in point, the core count per chip continues to increase dramatically while the available on-chip memory per core is only getting marginally bigger. Thus, data locality, already a must-have in high-performance computing, will become even more critical as memory technology progresses. In turn, this makes it crucial that new execution models be developed to better exploit the trends of future heterogeneous computing in many-core chips. To solve these issues, we propose a cross-cutting cross-layer approach to address the challenges posed by future heterogeneous many-core chips.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122309655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Limits of Statically-Scheduled Token Dataflow Processing","authors":"Nachiket Kapre, Siddhartha","doi":"10.1109/DFM.2014.21","DOIUrl":"https://doi.org/10.1109/DFM.2014.21","url":null,"abstract":"FPGA-based token dataflow processing has been shown to accelerate hard-to-parallelize problems exhibiting irregular dataflow parallelism by as much as an order of magnitude when compared to conventional compute organizations. However, when the structure of the dataflow computation is known upfront, either at compile time or at the start of execution, we can employ static scheduling techniques to further improve performance and enhance compute density of the dataflow hardware. In this paper, we identify the costs and performance trends of both static and dynamic scheduling approaches when considering hardware acceleration of SPICE device equations and Sparse LU factorization in circuit graphs. While the experiments are limited to a case study, the hardware design and dataflow compiler are general and can be extended to other problems and instances where dataflow computing may be applicable. With this study, we hope to develop a quantitative basis for the design of a hybrid dataflow architecture that combines both static and dynamic scheduling techniques. We observe a performance benefit of 2 - 4× and a resource utilization saving of 2 - 3× in favor of statically scheduled hardware.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126320308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language Features for Scalable Distributed-Memory Dataflow Computing","authors":"J. Wozniak, M. Wilde, Ian T Foster","doi":"10.1109/DFM.2014.17","DOIUrl":"https://doi.org/10.1109/DFM.2014.17","url":null,"abstract":"Dataflow languages offer a natural means to express concurrency but are not a natural representation of the architectural features of high-performance, distributed-memory computers. When used as the outermost language in a hierarchical programming model, dataflow is very effective at expressing the overall flow of a computation. In this work, we present strategies and techniques used by the Swift dataflow language to obtain good performance on extremely large computing systems. We also present multiple unique language features that offer practical utility and performance enhancements.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116233811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing the StreamIt and SC Languages for Manycore Processors","authors":"XuanKhanh Do, Stéphane Louise, Albert Cohen","doi":"10.1109/DFM.2014.13","DOIUrl":"https://doi.org/10.1109/DFM.2014.13","url":null,"abstract":"Embedded many-core systems offering thousands of cores should be available in the near future. Stream programming is a particular instance of data-flow programming where computations are expressed as the data-driven execution of repetitive \"filters\" on data streams. Stream programming fits these manycore systems' requirements in terms of parallelism, functional determinism, and local data reuse. Statically or semi-dynamically scheduled stream languages like e.g. StreamIt and ?C can generate very efficient parallel code, but have strict limitations with respect to the expression of dynamic computational tasks, context-dependent modes of operation, and dynamic memory management. This paper compares two state-of-the-art stream languages, StreamIt and ?C, with the aim of better understanding their strengths and weaknesses, and finding a way to improve them. We also propose an automatic conversion method and tool to transform between these two languages. This tool allows to port and evaluate the classical StreamIt benchmarks on Kalray's MPPA, a real-world many-core processor representative of tomorrow's embedded many-core chips. We conclude with propositions for the evolution of stream-programming models.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133745242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Feasibility of a Codelet Based Multi-core Operating System","authors":"J. Dennis, G. Gao","doi":"10.1109/DFM.2014.18","DOIUrl":"https://doi.org/10.1109/DFM.2014.18","url":null,"abstract":"We believe it is feasible to build a multi-core operating system that implements virtual memory, and honors the principles of modular software construction, using runtime software that executes a codelet program execution model. Performance and energy efficiency can be enhanced through co-design of new architecture features that replace resource management functions of runtime software with efficient hardware mechanisms. The resulting systems will offer benefits in programmability, application portability and reuse absent in current systems.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129526984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DFGR an Intermediate Graph Representation for Macro-Dataflow Programs","authors":"A. Sbîrlea, L. Pouchet, Vivek Sarkar","doi":"10.1109/DFM.2014.9","DOIUrl":"https://doi.org/10.1109/DFM.2014.9","url":null,"abstract":"In this paper we propose a new intermediate graph representation for macro-dataflow programs, DFGR, which is capable of offering a high-level view of applications for easy programmability, while allowing the expression of complex applications using dataflow principles. DFGR makes it possible to write applications in a manner that is oblivious of the underlying parallel runtime, and can easily be targeted by both programming systems and domain experts. In addition, DFGR can use further optimizations in the form of graph transformations, enabling the coupling of static and dynamic scheduling and efficient task composition and assignment, for improved scalability and locality. We show preliminary performance results for an implementation of DFGR on a shared memory runtim system, offering speedups of up to 11× on 12 cores, for complex graphs.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126631224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chih-Chieh Yang, J. C. Pichel, Adam R. Smith, D. Padua
{"title":"Hierarchically Tiled Array as a High-Level Abstraction for Codelets","authors":"Chih-Chieh Yang, J. C. Pichel, Adam R. Smith, D. Padua","doi":"10.1109/DFM.2014.11","DOIUrl":"https://doi.org/10.1109/DFM.2014.11","url":null,"abstract":"The move from terascale to exascale systems is challenging in terms of energy and power consumption, resilience, storage, concurrency, and parallelism. These challenges require new fine-grain execution models to support the concurrent execution of millions or even billions of threads on the exascale machines. The most promising approaches are those based on the codelet execution model, which provide a flexible programming interface that allows the expression of all kinds of parallelism with fine-tuning opportunities. We propose using Hierarchically Tiled Array (HTA) as a high-level abstraction for codelets to improve the programmability and readability of programs while preserving the good performance and scalability provided by the codelet execution model.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"22 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116394596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asynchronous Task Scheduling of the Fast Multipole Method Using Various Runtime Systems","authors":"Bo Zhang","doi":"10.1109/DFM.2014.14","DOIUrl":"https://doi.org/10.1109/DFM.2014.14","url":null,"abstract":"In this paper, we explore data-driven execution of the adaptive fast multipole method by asynchronously scheduling available computational tasks using Cilk, C++11 standard thread and future libraries, the High Performance ParalleX (HPX-5) library, and OpenMP tasks. By comparing these implementations using various input data sets, this paper examines the runtime system's capability to spawn new task, the capacity of the tasks that can be managed, the performance impact between eager and lazy thread creation for new task, and the effectiveness of the task scheduler and its ability to recognize the critical path of the underlying algorithm.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114881531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Clockless Computing System Based on the Static Dataflow Paradigm","authors":"L. Verdoscia, R. Vaccaro, R. Giorgi","doi":"10.1109/DFM.2014.10","DOIUrl":"https://doi.org/10.1109/DFM.2014.10","url":null,"abstract":"The ambitious challenges posed by next exascale computing systems may require a critical re-examination of both architecture design and consolidated wisdom in terms of programming style and execution model, because such systems are expected to be constituted by thousands of processors with thousands of cores per chip. But how to build exascale architectures remains an open question.This paper presents a novel computing system based on a configurable architecture and a static dataflow execution model. We assume that the basic computational unit is constituted by a dataflow graph. Each processing node is constituted by an ad hoc kernel processor - designed to manage and schedule dataflow graphs, and a manycore dataflow execution engine - designed to execute such dataflow graphs.The main components of the dataflow execution engine are the Dataflow Actor Cores (DACs), which are small, identical and configurable. The major contributions of this paper are: i) the introduction of a machine language (named D#) which represents the low-level static configuration information of the system; ii) the introduction of a self-scheduled clockless mechanism to start operations on the presence of validity tokens only; iii) a design that avoids the need of temporary storage for tokens on the links of the DACs.Our preliminary tests on FPGA-based hardware show the feasibility of this approach.","PeriodicalId":183526,"journal":{"name":"2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130611967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}