{"title":"Hardware Threading Techniques for Multi-Threaded MPSoCs","authors":"D. Watson, A. Ahmadinia, G. Morison, T. Buggy","doi":"10.1145/2613908.2613917","DOIUrl":"https://doi.org/10.1145/2613908.2613917","url":null,"abstract":"Adapting software applications to embedded Multiprocessor System on Chips (MPSoCs) typically follows multithreaded design flows. To take advantage of the hardware customisations possible with MPSoCs, HardWare Threads (HWTs) can be used to increase application concurrency and throughput by forking between software and hardware execution. This paper describes how an application can be tailored to use HWTs. Using an application's Task Flow Graph and Kahn Process Networks to model software interactions with HWTs, two scheduling techniques for HWT interaction with software are presented and analysed. The scheduling techniques are evaluated based on system performance and resource consumption with a popular image processing algorithm, where performance increases of up to 3.6x were measured compared to standard implementations.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"13 1","pages":"56-59"},"PeriodicalIF":0.0,"publicationDate":"2014-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73934605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hassan Anwar, Syed M. A. H. Jafri, Sergei Dytckov, M. Daneshtalab, M. Ebrahimi, A. Hemani, J. Plosila, G. Beltrame, H. Tenhunen
{"title":"Exploring Spiking Neural Network on Coarse-Grain Reconfigurable Architectures","authors":"Hassan Anwar, Syed M. A. H. Jafri, Sergei Dytckov, M. Daneshtalab, M. Ebrahimi, A. Hemani, J. Plosila, G. Beltrame, H. Tenhunen","doi":"10.1145/2613908.2613916","DOIUrl":"https://doi.org/10.1145/2613908.2613916","url":null,"abstract":"Today, reconfigurable architectures are becoming increasingly popular as the candidate platforms for neural networks. Existing works, that map neural networks on reconfigurable architectures, only address either FPGAs or Networks-on-chip, without any reference to the Coarse-Grain Reconfigurable Architectures (CGRAs). In this paper we investigate the overheads imposed by implementing spiking neural networks on a Coarse Grained Reconfigurable Architecture (CGRAs). Experimental results (using point to point connectivity) reveal that up to 1000 neurons can be connected, with an average response time of 4.4 msec.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"10 1","pages":"64-67"},"PeriodicalIF":0.0,"publicationDate":"2014-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90279397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending dataflow programs with throughput properties","authors":"Manuel Selva, L. Morel, K. Marquet, S. Frénot","doi":"10.1145/2489068.2489077","DOIUrl":"https://doi.org/10.1145/2489068.2489077","url":null,"abstract":"In the context of multi-core processors and the trend toward many-core, dataflow programming can be used as a solution to the parallelization problem. By decoupling computation from communication, this paradigm naturally exposes parallelism in several ways. In this work we propose language extensions for expressing throughput properties over dataflow programs together with a run-time mechanism for the observation of events meaningful to compute the effective throughput. We show the limited impact of such mechanisms on the application overall performances. We also review existing run-time adaptation mechanisms that may be used in a dataflow context to satisfy throughput requirements.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"144 1","pages":"54-57"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86400444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Directory based cache coherence verification logic in CMPs cache system","authors":"M. Dalui, K. Gupta, B. Sikdar","doi":"10.1145/2489068.2489073","DOIUrl":"https://doi.org/10.1145/2489068.2489073","url":null,"abstract":"This work reports a high speed protocol verificaion logic for Chip Multiprocessors (CMPs) realizing directory based cache coherence system. A special class of cellular automata (CA) referred to as single length cycle 2-attractor CA (TACA), has been introduced to identify the inconsistencies in cache line states of processors private caches. The introduction of CA segmentation logic ensures a better efficiency in the design by reducing the number of computation steps of the verification logic by a factor of the number of segments. The cache coherence verification for a system with limited directory has also been addressed. The TACA keeps trace of the coherence status of the CMPs' cache system and memorizes any inconsistent recording done during the processors' reference. Theory has been developed to realize quick decision on the cache coherency.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"46 1","pages":"33-40"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88901450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vijayalakshmi Saravanan, Kaushik S, S. Krishna, P. Iit, Guwahati India, D. Kothari
{"title":"Performance analysis of multi-threaded multi-core CPUs","authors":"Vijayalakshmi Saravanan, Kaushik S, S. Krishna, P. Iit, Guwahati India, D. Kothari","doi":"10.1145/2489068.2489076","DOIUrl":"https://doi.org/10.1145/2489068.2489076","url":null,"abstract":"Processors are constantly changing and becoming more advanced. They incorporate new concepts and ideas into the architecture with each evolution. One such concept is multi-threading. It aims at increasing the processors performance by reducing its idle time. It is the ability of the processor to execute multiple threads simultaneously on different cores present inside. Multi-threading concepts have also been incorporated in embedded systems which employ either a single-core or multi-core architecture. The aim of this study is to evaluate how effectively multi-threading improves processor utilization on multiple cores by taking both single and dual core processors and evaluating the performance of each by comparing the number of instructions executed per second. The results of this study give an edge to multi-threading in a single-core processor when compared to a dual-core processor when performance aspects are considered. Our analysis helps us to design the processor architecture in such a way that we utilize both the concepts of multi-threading and multi-core architecture to achieve maximum performance. The results of Simultaneous Multi-threading (SMT) performance improvement is encouraging when compared with dual-core processors.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"72 1","pages":"49-53"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84024058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Co-tuning of a hybrid electronic-optical network for reducing energy consumption in embedded CMPs","authors":"S. Bartolini, P. Grani","doi":"10.1145/2489068.2489070","DOIUrl":"https://doi.org/10.1145/2489068.2489070","url":null,"abstract":"Nanophotonic is a promising solution for on-chip interconnection due to its intrinsic low-latency and especially low-power features, desirable especially in future chip multiprocessors (CMPs) for rich client devices. In this paper we address the co-design of the parameters of a hybrid on-chip network featuring a traditional 2D mesh and a simple photonic helper ring aimed to improve performance and reduce energy consumption. As all the CMP traffic cannot be sustained in the considered simple optical interconnection without saturating the available bandwidth, and thus inducing performance and energy degradations, we identify the subset of coherency messages that are most worth to be accelerated through the low-energy optical path.\u0000 We investigate the management/arbitration strategies for the physically shared photonic path as they are crucial for reaching an effective exploitation of optical bandwidth according to their overhead and parallelism achieved in message transmission. Our results on multithreaded benchmarks, highlight that a careful selection of the most latency-critical messages to be routed on the photonic-path along with a Multiple-Writers-Single-Reader access scheme allows execution time and energy improvements up to 19% and 5%, respectively, for the 8-core setup and up to 16% and 13% for the 16-core configuration.\u0000 Furthermore, we show that the most aggressive ring access schemes allow the adoption of a four times slower electronic NoC that trades the achieved average speedup margin to obtain 70% overall energy savings, which is extremely important in energy constrained devices.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"10 1","pages":"9-16"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89316950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proposing a new task model towards many-core architecture","authors":"A. Shimada, Balazs Gerofi, A. Hori, Y. Ishikawa","doi":"10.1145/2489068.2489075","DOIUrl":"https://doi.org/10.1145/2489068.2489075","url":null,"abstract":"Many-core processors are gathering attention in the areas of embedded systems due to their power-performance ratios. To utilize cores of a many-core processor in parallel, programmers build multi-task applications that use the task models provided by operating systems. However, the conventional task models cause some scalability problems when multi-task applications are executed on many-core processors. In this paper, a new task model named Partitioned Virtual Address Space (PVAS), which solves the problems, is proposed. PVAS enhances inter-task communications of multi-task applications and averts serialization of concurrent virtual memory operations. Preliminary evaluations by using micro benchmarks showed that PVAS has the potential to promote the performance of multi-task applications that run on many-core processors.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"99 1","pages":"45-48"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78003885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitra Papagiannopoulou, R. I. Bahar, T. Moreshet, M. Herlihy, A. Marongiu, L. Benini
{"title":"Transparent and energy-efficient speculation on NUMA architectures for embedded MPSoCs","authors":"Dimitra Papagiannopoulou, R. I. Bahar, T. Moreshet, M. Herlihy, A. Marongiu, L. Benini","doi":"10.1145/2489068.2489078","DOIUrl":"https://doi.org/10.1145/2489068.2489078","url":null,"abstract":"High-end embedded systems such as smart phones, game consoles, GPS-enabled automotive systems, and home entertainment centers, are becoming ubiquitous. Like their general-purpose counterparts, and for many of the same energy-related reasons, embedded systems are turning to multicore architectures. Moreover, as the demand for more compute-intensive capabilities for embedded systems increases, these multicore architectures will evolve into many-core systems for improved performance or performance/area/Watt. These systems are often organized as cluster based Non-Uniform Memory Access (NUMA) architectures that provide the programmer with a shared-memory abstraction, with the cost of sharing memory (in terms of performance, energy, and complexity) varying substantially depending on the locations of the communicating processes. This paper investigates one of the principal challenges presented by these emerging NUMA architectures for embedded systems: providing efficient, energy-effective and convenient mechanisms for synchronization and communication. In this paper, we propose an initial solution based on hardware support for speculative synchronization.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"10 1","pages":"58-61"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90008809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuo Li, Jamshaid Sarwar Malik, Shaoteng Liu, A. Hemani
{"title":"A code generation method for system-level synthesis on ASIC, FPGA and manycore CGRA","authors":"Shuo Li, Jamshaid Sarwar Malik, Shaoteng Liu, A. Hemani","doi":"10.1145/2489068.2489072","DOIUrl":"https://doi.org/10.1145/2489068.2489072","url":null,"abstract":"This paper presents a code generation method that translates an intermediate Register-Transfer Level (RTL) model of a system into its corresponding VHDL code for ASIC and FPGAs and MATLAB functions for manycores CGRAs. The intermediate representation consists of Function Implementation (FIMPs) and the glue logic. FIMPs are VHDL design units for the ASIC and FPGA implementation styles and MATLAB function templates for the CGRA implementation style, while the glue logic is a compact data structure storing Global Interconnect and Control (GLIC) information.\u0000 The automatically generated implementation codes increase the resource usage by 1.5% on the average while reducing total design effort by two orders of magnitudes.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"62 1","pages":"25-32"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80557884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Marongiu, Alessandro Capotondi, Giuseppe Tagliavini, L. Benini
{"title":"Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP","authors":"A. Marongiu, Alessandro Capotondi, Giuseppe Tagliavini, L. Benini","doi":"10.1145/2489068.2489069","DOIUrl":"https://doi.org/10.1145/2489068.2489069","url":null,"abstract":"Heterogeneous architectures based on one fast-clocked, moderately multicore \"host\" processor plus a many-core accelerator represent one promising way to satisfy the ever-increasing GOps/W requirements of embedded systems-on-chip. However, heterogeneous computing comes at the cost of increased programming complexity, requiring major rewrite of the applications with low-level programming style (e.g, OpenCL). In this paper we present a programming model, compiler and runtime system for a prototype board from STMicroelectronics featuring a ARM9 host and a STHORM many-core accelerator. The programming model is based on OpenMP, with additional directives to efficiently program the accelerator from a single host program. The proposed multi-ISA compilation toolchain hides all the process of outlining an accelerator program, compiling and loading it to the STHORM platform and implementing data sharing between the host and the accelerator. Our experimental results show that we achieve very close performance to hand-optimized OpenCL codes, at a significantly lower programming complexity.","PeriodicalId":84860,"journal":{"name":"Histoire & mesure","volume":"175 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2013-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79703560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}