S. Agathos, V. Dimakopoulos, Aggelos Mourelis, Alexandros Papadogiannakis
{"title":"Deploying OpenMP on an embedded multicore accelerator","authors":"S. Agathos, V. Dimakopoulos, Aggelos Mourelis, Alexandros Papadogiannakis","doi":"10.1109/SAMOS.2013.6621121","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621121","url":null,"abstract":"Multiprocessor systems-on-chip (MPSoC) are now considered first-class citizens both in the embedded systems and in the high-performance computing arenas, in the form of specialized or general-purpose accelerators. Programming models for such systems is currently a hot research topic, and as a general rule require deep programmer knowledge of the underlying hardware architecture. In this paper we present the implementation of OpenMP, one of the most intuitive and productive programming models, on the STHORM accelerator. This particular platform provides a shared-memory substrate which OpenMP requires. An innovative feature of our design is the deployment of the OpenMP model both at the host and the fabric sides, in a seamless way, which provides the programmer with a simple but effective interface for offloading and executing OpenMP kernels on the MPSoC. The optimized runtime environment provides full OpenMP support despite its small footprint (less than 10KB for a 16-core cluster) and can sustain close-to-ideal speedups in computationally intensive applications. We detail on design issues we faced along with their solutions, given the limited available resources.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121666860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deepak Revanna, Omer Anjum, Manuele Cucchi, Roberto Airoldi, J. Nurmi
{"title":"A scalable FFT processor architecture for OFDM based communication systems","authors":"Deepak Revanna, Omer Anjum, Manuele Cucchi, Roberto Airoldi, J. Nurmi","doi":"10.1109/SAMOS.2013.6621101","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621101","url":null,"abstract":"The modern wireless standards predominantly are based on OFDM communication systems. Various mobile devices in recent times support multiple wireless standards and demand efficient transceiver. Hence, in a communication transceiver the baseband hardware needs to be scalable and efficient across multiple standards. In an OFDM based transceiver, FFT computation is one of the most computationally intensive and power hungry modules. Design of FFT hardware is a challenging task while balancing design parameters such as speed, power, area, flexibility and scalability. The research work in this paper proposes a scalable radix-2 N-point novel FFT processor architecture. The architecture design is based on an approach to balance various specified design parameters to meet the requirements of SDR platforms supporting multiple wireless standards. The FFT processor was designed and prototyped using VHDL on an Altera Stratix V FPGA device 5SGSMD5K2F40C2. The processor operates at a maximum frequency of 200MHz, uses less than 1% of FPGA device resources and meets the performance requirements of multiple wireless standards such as IEEE 802.11a/g, IEEE 802.16e, 3GPP-LTE, DAB and DVB. The proposed architecture outperforms the existing fixed and variable length FFT processors in terms of speed, flexibility and scalability.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130783397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast transaction-level dynamic power consumption modelling in priority preemptive wormhole switching networks on chip","authors":"J. Harbin, L. Indrusiak","doi":"10.1109/SAMOS.2013.6621120","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621120","url":null,"abstract":"This paper specifies an architecture for power consumption modelling integrated within cycle-approximate transaction level modelling for network-on-chip (NoC) simulation. NoC simulations during design validation have traditionally been limited to very short durations, due to the necessity to perform cycle-accurate simulation to represent fully the low level system simulated. Due to the high proportion of overall system power that may be consumed by a busy NoC, high-fidelity NoC power modelling is especially important to accurately assess the effectiveness of link coding and other strategies to reduce NoC power consumption. The paper describes the extension of a cycle-approximate TLM methodology to encompass power modelling in NoCs, considering its operation with real application traffic. The proposed scheme avoids modelling of flit-by-flit progress during non-preemptive periods of packet transmission. The simulation performance and accuracy are contrasted with theoretical models and a flit-by-flit scheme (in which each flow control digit passing along a bus wire is simulated). The power consumption reduction delivered by encoding schemes such as bus-invert coding are considered and compared with analytical models to verify the correct performance of the simulation models.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131574549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Theodoropoulos, Polyvios Pratikakis, D. Pnevmatikatos
{"title":"Efficient runtime support for embedded MPSoCs","authors":"D. Theodoropoulos, Polyvios Pratikakis, D. Pnevmatikatos","doi":"10.1109/SAMOS.2013.6621119","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621119","url":null,"abstract":"Recently, many software runtime systems have been proposed that allow developers to efficiently map applications to contemporary consumer electronic devices and high-performance academic processing platforms. Most of these runtime systems employ advanced scheduling techniques for automatic task assignment to all available processing elements. However, they focus on a particular environment and architecture, and it is not easy to port them to reconfigurable embedded MPSoCs. As a consequence, in the embedded community, researchers implement hardwired application-specific task schedulers, which can not be used by other embedded MPSoCs. To address this problem, in this paper we propose a lightweight runtime software framework for reconfigurable shared-memory MPSoCs, that integrate a master embedded processor connected to slave cores. Similarly to many of the aforementioned advanced runtime systems, we adopt a task-based programming model that uses simple, pragma-based annotations of the application software, in order to dynamically resolve task dependencies. Our runtime system supports heterogeneity in the hardware resources, and is also low-overhead to account for possible limitations in their processing capabilities and available on-chip memory. To evaluate our proposal, we have prototyped an MPSoC with seven slaves to a Xilinx ML605 FPGA board. We run three micro-benchmarks that achieve a performance speedup of 3.8x, 7x and 5.8x, and energy consumption of 27%, 14% and 18% respectively, compared to a single-core baseline system with no runtime support.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130983053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ferreira, Vinicius Duarte, Waldir Meireles, M. Pereira, L. Carro, Stephan Wong
{"title":"A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures","authors":"R. Ferreira, Vinicius Duarte, Waldir Meireles, M. Pereira, L. Carro, Stephan Wong","doi":"10.1109/SAMOS.2013.6621122","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621122","url":null,"abstract":"In the past decade, most solutions concerning the mapping of the compute-intensive loop kernels to accelerators have used heuristics and compiler-based strategies. These facts require that most of the decisions be taken at design time, thus precluding efficient solutions that can take run-time information into account. Any success in accelerating such applications greatly depends on two steps, extracting the loops and mapping them into the architecture. This last step is a challenge in itself since it is a NP-complete problem. In this paper, we propose a runtime solution that can provide speed ups of 3 to 6 orders of magnitude for the mapping step when compared to the state-of-the-art at minimal performance degradation, by the combined usage of 3 distinct mechanisms: 1) a simple and efficient modulo scheduling heuristic, 2) a crossbar network, which simplifies the placement and routing, 3) a virtual coarse-grained reconfigurable architecture (CGRA). Additionally, since the CGRA is a virtual layer on top of an FPGA, it is possible to use any off-the-shelf FPGA without the need of special tools or IP solutions. Although the mapping is NP-complete even for crossbar-based CGRAs, experimental results demonstrate a huge reduction in compilation time, as opposed to previous solutions that require seconds to map the applications, our solution requires only microseconds to find near optimal schedules. Besides the speed up, the proposed solution enables the use of just-in-time compilation, hence it is intrinsically adaptive to a changing scenario.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116096042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"on-Demand system reliability: The DeSyRe project","authors":"I. Sourdis","doi":"10.1109/SAMOS.2013.6621130","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621130","url":null,"abstract":"Summary form only given. The DeSyRe project builds on-demand adaptive, reliable Systems-on-Chips. In response to the current semiconductor technology trends that make chips becoming less reliable, DeSyRe describes a new generation of by design reliable systems, at a reduced power and performance cost. This is achieved through the following main contributions. DeSyRe defines a fault-tolerant system architecture built out of unreliable components, rather than aiming at totally fault-free, and hence more costly chips. In addition, DeSyRe systems are on-demand adaptive to various types and densities of faults, as well as to other system constraints and application requirements. For leveraging on-demand adaptation/customization and reliability at reduced cost, a new dynamically reconfigurable substrate is proposed and combined with runtime system software support. The above define a generic and repeatable design framework for a large variety of SoCs, which within the project - is applied to two medical SoCs with high reliability constraints and diverse performance and power requirements. In this talk, an overview of the DeSyRe and our current research findings are described.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125052840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling pipelined application with Synchronous Data Flow graphs","authors":"M. Lattuada, Fabrizio Ferrandi","doi":"10.1109/SAMOS.2013.6621105","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621105","url":null,"abstract":"Streaming applications can efficiently exploit multiprocessors architectures by means of pipelined parallelism, but designing this type of applications can be an hard task. Different subproblems have indeed to be solved: partitioning, mapping, scheduling and pipeline stage assignment. For this reason, high level abstraction models are adopted during design flow since they simplify this process by hiding most of the architectural details. Synchronous Data Flow (SDF) graphs, widely adopted to describe streaming applications, naturally model only their partitioning, so they usually have to be integrated with other types of representations. In this paper Pipelined Application Modeling (PAM), a methodology to create a Synchronous Data Flow graph describing all the aspects of a pipelined application, is presented. The methodology starts from the SDF graph describing the partitioning of the application and enriches it with new actors and channels detailing the mapping, the scheduling and the pipeline stage assignment of the considered solution. The obtained SDF graph, describing all the aspects of the solution in a formal and compact way, facilitates the evaluation of different solutions during design space exploration.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126138638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster unicores are still needed","authors":"André Seznec","doi":"10.1109/SAMOS.2013.6621098","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621098","url":null,"abstract":"Summary form only given. For the last decade, the advent of the multicore era has driven most of the research architecture effort from the high end processor industry as well as from the computer architecture research community. However, many of the workloads executed on our servers, PCs, smartphones, tablets are still inherently sequential or multiprogrammed; even parallel applications require high sequential performance. Therefore, there are new opportunities for research in uniprocessor microarchitecture. In this presentation, I will first present the motivations of the ERC grant DAL Defying Amdahl's Law. I will then present two research actions on processor core architecture that are on-going within the DAL framework in the INRIA/IRISA ALF group: revisiting value prediction and out-of-order execution of predicated instruction sets.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122281859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SIMD made explicit","authors":"Luc Waeijen, Dongrui She, H. Corporaal, Yifan He","doi":"10.1109/SAMOS.2013.6621142","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621142","url":null,"abstract":"Low energy consumption has become one of the most important topics in computing. With single CPUs consuming as much as 115 Watt, engineers have been looking for ways to reduce energy consumption while maintaining high computational performance. Often wide SIMD architectures are used to achieve this, exploiting data parallelism to keep the required clock frequency low for a given compute constraint. In this paper, we propose a wide SIMD architecture with explicit datapath to further optimize energy efficiency without sacrificing computation power. To have a detailed comparison, both the proposed wide SIMD architecture and its transparent bypassing counterpart are implemented in HDL and synthesized with a TSMC 40nm low power library. The power estimation is derived from actual toggle rates generated by post-synthesis simulation. Our experimental results show that with explicit bypassing the overall energy consumption can be reduced up to 44% compared to the corresponding SIMD architecture with transparent bypassing.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132391799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alireza Rohani, H. Kerkhoff, Enrico Costenaro, D. Alexandrescu
{"title":"Pulse-length determination techniques in the rectangular single event transient fault model","authors":"Alireza Rohani, H. Kerkhoff, Enrico Costenaro, D. Alexandrescu","doi":"10.1109/SAMOS.2013.6621125","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621125","url":null,"abstract":"One of the well-known models to represent Single Event Transient phenomenon at the logic-level is the rectangular pulse model. However, the pulse-length in this model has a vital contribution to the accuracy and validity of the rectangular pulse model. The work presented in this paper develops two approaches for determination of the pulse-length of the rectangular pulse model used in Single Event Transient (SET) faults. The first determination approach has been extracted from radiation testing along with transistor-level SET analysis tools. The second determination approach has been elicited from asymptotic analytical behaviour of SETs in 45-nm CMOS process. The results show that applying these two pulse-length determination approaches to the rectangular pulse model will cause the fault injection results converge much faster (up to sixteen times), compared to other conventional approaches.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131088923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}