{"title":"Maximum performance computing for exascale applications","authors":"O. Mencer","doi":"10.1109/SAMOS.2012.6404150","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404150","url":null,"abstract":"Summary form only given. Ever since Fermi, Pasta and Ulam conducted the first fundamentally important numerical experiments in 1953, science has been driven by the progress of available computational capability. In particular, computational quantum chemistry and computational quantum physics depend on ever increasing amounts of computation. However, due to power density limitations at the chip we have seen the end of single CPU performance scaling. Now the challenge is to improve compute performance through some form of parallel processing without incurring power limits at the system level. One way to deal with the system “power wall” question is to ask “what is the maximum amount of computation that can be achieved within a certain power budget”. We argue that such Maximum Performance Computing needs to focus on end-to-end execution time of complete scientific applications and needs to include a multi-disciplinary approach, bringing together scientists and engineers to optimize the whole process from mathematics and algorithms all the way down to arithmetic and number representation. We have done a number of such multidisciplinary studies with our customers (Chevron, Schlumberger, and JP Morgan). Our current results with Maxeler Dataflow Engines for production PDE solver applications in Earth Sciences and Finance show an improvement of 20-40x in Speed and/or Watts per application run.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"6 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient asymmetric distributed lock for embedded multiprocessor systems","authors":"J. Rutgers, M. Bekooij, G. Smit","doi":"10.1109/SAMOS.2012.6404172","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404172","url":null,"abstract":"Efficient synchronization is a key concern in an embedded many-core system-on-chip (SoC). The use of atomic read-modify-write instructions combined with cache coherency as synchronization primitive is not always an option for shared-memory SoCs due to the lack of suitable IP. Furthermore, there are doubts about the scalability of hardware cache coherency protocols. Existing distributed locks for NUMA multiprocessor systems do not rely on cache coherency and are more scalable, but exchange many messages per lock. This paper introduces an asymmetric distributed lock algorithm for shared-memory embedded multiprocessor systems without hardware cache coherency. Messages are exchanged via a low-cost inter-processor communication ring in combination with a small local memory per processor. Typically, a mutex is used over and over again by the same process, which is exploited by our algorithm. As a result, the number of messages exchanged per lock is significantly reduced. Experiments with our 32-core system show that when having locks in SDRAM, 35% of the memory traffic is lock related. In comparison, our solution eliminates all of this traffic and reduces the execution time by up to 89%.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122696336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A quantitative analysis of fixed-point LDPC-decoder implementations using hardware-accelerated HDL emulations","authors":"Matthias Korb, T. Noll","doi":"10.1109/SAMOS.2012.6404189","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404189","url":null,"abstract":"Using hardware-accelerated HDL emulators of fixed-point implementations has several advantages in comparison to C-based simulations: The high degree of parallelism for example of field-programmable gate-array based hardware accelerators promise an increased emulation throughput. Furthermore, the HDL model of the considered circuit can be used in the following design process making an additional verification dispensable. For a system analysis of different low-density parity-check (LDPC) decoders such an emulator is practically inevitable from a throughput perspective: the outstanding error correction capability of those decoders allowing for bit-error rates (BER) of well below 10-10 requires a simulative decoding of billions of blocks. In this work, an HDL-based emulator is used. The designed HDL model is highly parameterizable and includes an LDPC decoder and high-quality Box-Muller-based white Gaussian-noise generators to create rare error-events. Using this emulator a comparison of the decoding capability of different fixed-point decoder implementations has been performed. Additionally, accurate cost-models are used for estimating the hardware costs of the different decoder implementations which enable an identification of Pareto-optimal decoder implementations. Finally, the achievable emulator throughput is discussed and compared to the simulation throughput of a speed optimized C-model.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128706442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predictable dynamic embedded data processing","authors":"M. Geilen, S. Stuijk, T. Basten","doi":"10.1109/SAMOS.2012.6404194","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404194","url":null,"abstract":"Cyber-physical systems interact with their physical environment. In this interaction, non-functional aspects, most notably timing, are essential to correct operation. In modern systems, dynamism is introduced in many different ways. The additional complexity threatens timely development and reliable operation. Applications often have different modes of operation with different resource requirements and different levels of required quality-of-service. Moreover, multiple applications in dynamically changing combinations share a platform and its resources. To preserve efficient development of such systems, dynamism needs to be taken into account as a primary concern, not as a verification or tuning effort after the design is done. This requires a model-driven design approach in which timing of interaction with the physical environment is taken into consideration; formal models capture applications and their platforms in the physical environment. Moreover, platforms with resources and resource arbitration are needed that allow for predictable and reliable behavior to be realized. Run-time management is further required to deal with dynamic use-cases and dynamic trade-offs encountered at run-time. In this paper, we present a model-driven approach that combines model-based design and synthesis with development of platforms that support predictable, repeatable, composable realizations and a run-time management approach to deal with dynamic use-cases at run-time. A formal, compositional model is used to exploit Pareto-optimal trade-offs in the system use. The approach is illustrated with dataflow models with dynamic application scenarios, a predictable platform architecture and run-time resource management that determines optimal trade-offs through an efficient knapsack heuristic.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"81 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116733115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient system design using the Statistical Analysis of Architectural Bottlenecks methodology","authors":"Manish Arora, Feng Wang, Bob Rychlik, D. Tullsen","doi":"10.1109/SAMOS.2012.6404177","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404177","url":null,"abstract":"CPU processor design involves a large set of increasingly complex design decisions Doing full, accurate simulation of all possible designs is typically not feasible. Prior techniques for sensitivity analysis seek to identify the most critical design parameters, but also struggle to handle the increasing design space well. They can be overly sensitive to the starting fixed point of the design, can still require a large number of simulations, and do not necessarily account for the cost of each design parameter. The Statistical Analysis of Architectural Bottlenecks (SAAB) methodology simultaneously analyzes multiple parameters and requires a small number of experiments. SAAB leverages the Plackett and Burman analysis method, but builds upon the technique in two specific ways. It allows a parameter to take multiple values and replaces the unit-less impact factor with a cost-proportional impact value. This paper applies the SAAB methodology to the design of a mobile processor sub-system. It considers area and power cost models for the design.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115667048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Lemonnier, P. Millet, G. M. Almeida, M. Hübner, J. Becker, S. Pillement, O. Sentieys, Martijn Koedam, Shubhendu Sinha, K. Goossens, C. Piguet, M. Morgan, R. Lemaire
{"title":"Towards future adaptive multiprocessor systems-on-chip: An innovative approach for flexible architectures","authors":"F. Lemonnier, P. Millet, G. M. Almeida, M. Hübner, J. Becker, S. Pillement, O. Sentieys, Martijn Koedam, Shubhendu Sinha, K. Goossens, C. Piguet, M. Morgan, R. Lemaire","doi":"10.1109/SAMOS.2012.6404179","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404179","url":null,"abstract":"This paper introduces adaptive techniques targeted for heterogeneous manycore architectures and introduces the FlexTiles platform, which consists of general purpose processors with some dedicated accelerators. The different components are based on low power DSP cores and an eFPGA on which dedicated IPs can be dynamically configured at run-time. These features enable a breakthrough in term of computing performance while improving the on-line adaptive capabilities brought from smart heuristics. Thus, we propose a virtualisation layer which provides a higher abstraction level to mask the underlying heterogeneity present in such architectures. Given the large variety of possible use cases that these platforms must support and the resulting workload variability, offline approaches are no longer sufficient because they do not allow coping with time changing workloads. The upcoming generation of applications include smart cameras, drones, and cognitive radio. In order to facilitate the architecture adaptation under different scenarios, we propose a programming model that considers both static and dynamic behaviors. This is associated with self adaptive strategies endowed by an operating system kernel that provides a set of functions that guarantee quality of service (QoS) by implementing runtime adaptive policies. Dynamic adaptation will be mainly used to reduce both overall power consumption and temperature and to ease the problem of decreasing yield and reliability that results from submicron CMOS scales.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128070062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design space exploration in application-specific hardware synthesis for multiple communicating nested loops","authors":"R. Corvino, A. Gamatie, M. Geilen, L. Józwiak","doi":"10.1109/SAMOS.2012.6404166","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404166","url":null,"abstract":"Application specific MPSoCs are often used to implement high-performance data-intensive applications. MPSoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125131416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The homogeneity of architecture in a heterogeneous world","authors":"J. Goodacre","doi":"10.1109/SAMOS.2012.6404148","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404148","url":null,"abstract":"Summary form only given. It has long been accepted within embedded computing that using a heterogeneous core focused to a specific task can deliver improved performance and subsequent improved power efficiency. The challenge has always been how can the software programmer integrate this hardware diversity as workloads become generalized or often unknown at design time? Using library abstraction permits specific tasks to benefit from heterogeneity, but how can general purpose code benefit? This talk will describe the hardware and software techniques currently being developed in next generation ARM based SoC to address the challenge of maintaining the homogeneity of the software architecture while extending to the benefits of heterogeneity in hardware.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"353 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125669846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Just-in-Time Verification in ADL-based processor design","authors":"Dominik Auras, Andreas Minwegen, Uwe Deidersen","doi":"10.1109/SAMOS.2012.6404151","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404151","url":null,"abstract":"A novel verification methodology, combining the two new techniques of Live Verification and Processor State Transfer, is introduced to Architecture Description Language (ADL) based processor design. The proposed Just-in-Time Verification significantly accelerates the simulation-based equivalence check of the register-transfer and instruction-set level models, generated from the ADL-based specification. This is accomplished by omitting redundant simulation steps occurring in the conventional architecture debug cycle. The potential speedup is demonstrated with a case study, achieving an acceleration of the debug cycle by 660x.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120937870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Goulas, P. Alefragis, N. Voros, Christos Valouxis, Christos G Gogos, N. Kavvadias, G. Dimitroulakos, K. Masselos, D. Göhringer, Steven Derrien, D. Ménard, O. Sentieys, M. Hübner, Timo Stripf, Oliver Oey, J. Becker, G. Rauwerda, K. Sunesen, D. Kritharidis, N. Mitas
{"title":"From Scilab to multicore embedded systems: Algorithms and methodologies","authors":"G. Goulas, P. Alefragis, N. Voros, Christos Valouxis, Christos G Gogos, N. Kavvadias, G. Dimitroulakos, K. Masselos, D. Göhringer, Steven Derrien, D. Ménard, O. Sentieys, M. Hübner, Timo Stripf, Oliver Oey, J. Becker, G. Rauwerda, K. Sunesen, D. Kritharidis, N. Mitas","doi":"10.1109/SAMOS.2012.6404184","DOIUrl":"https://doi.org/10.1109/SAMOS.2012.6404184","url":null,"abstract":"While advances in processor architecture continues to increase hardware parallelism, parallel software creation is hard. There is an increasing need for tools and methodologies to narrow the entry gap for non-experts in parallel software development as well as to streamline the work for experts. This paper presents the methodology and algorithms for the creation of parallel software written in Scilab source code for multicore embedded processors in the context of the “Architecture oriented paraLlelization for high performance embedded Multicore systems using scilAb” (ALMA) EU FP7 project. The ALMA parallelization approach in a nutshell attempts to manage the complexity of the task by alternating focus between very localized and holistic view program optimization strategies.","PeriodicalId":130275,"journal":{"name":"2012 International Conference on Embedded Computer Systems (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129785617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}