{"title":"Lightweight resource estimation model to extend battery life in video playback","authors":"Tero Rintaluoma, O. Silvén","doi":"10.1109/SAMOS.2013.6621111","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621111","url":null,"abstract":"Modern mobile computing devices are integrating more and more processing units into a single platform. For this reason separate resource managers are needed to control computing resources effectively and in an energy efficient way. Multimedia signal processing applications, such as a video playback, have huge variations in their resource needs depending on the input sequence characteristics. To improve this situation a resource usage estimation model has been developed and evaluated. By using the model it is possible to estimate the resource needs with less than 10% error marginal on average. A proper exploitation of prior information improves both the user experience and the battery life. On the testing platform already a single step in the dynamic voltage and frequency scaling scheme corresponds to 30 minutes in high definition video playback time. The approach does not require changes to the existing encoders or decoders.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134365203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Desnos, M. Pelcat, J. Nezan, S. Bhattacharyya, Slaheddine Aridhi
{"title":"PiMM: Parameterized and Interfaced dataflow Meta-Model for MPSoCs runtime reconfiguration","authors":"K. Desnos, M. Pelcat, J. Nezan, S. Bhattacharyya, Slaheddine Aridhi","doi":"10.1109/SAMOS.2013.6621104","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621104","url":null,"abstract":"Dataflow models of computation are widely used for the specification, analysis, and optimization of Digital Signal Processing (DSP) applications. In this paper a new meta-model called PiMM is introduced to address the important challenge of managing dynamics in DSP-oriented representations. PiMM extends a dataflow model by introducing an explicit parameter dependency tree and an interface-based hierarchical compositionality mechanism. PiMM favors the design of highly-efficient heterogeneous multicore systems, specifying algorithms with customizable trade-offs among predictability and exploitation of both static and adaptive task, data and pipeline parallelism. PiMM fosters design space exploration and reconfigurable resource allocation in a flexible dynamic dataflow context.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"80 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131450944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An embedded hardware-efficient architecture for real-time cascade Support Vector Machine classification","authors":"C. Kyrkou, T. Theocharides, C. Bouganis","doi":"10.1109/SAMOS.2013.6621115","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621115","url":null,"abstract":"Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm yielding high accuracy rates. However, SVMs often require processing a large number of support vectors, making the classification process computationally demanding, especially when considering embedded applications. Cascade SVMs have been proposed in an attempt to speed-up classification times, but improved performance comes at a cost of additional hardware resources. Consequently, in this paper we propose an optimized architecture for cascaded SVM processing, along with a hardware reduction method in order to reduce the overheads from the implementation of additional stages in the cascade, leading to significant resource and power savings for embedded applications. The architecture was implemented on a Virtex 5 FPGA platform and evaluated using face detection as the target application on 640×480 resolution images. Additionally, it was compared against implementations of the same cascade processing architecture but without using the reduction method, and a single parallel SVM classifier. The proposed architecture achieves an average performance of 70 frames-per-second, demonstrating a speed-up of 5× over the single parallel SVM classifier. Furthermore, the hardware reduction method results in the utilization of 43% less hardware resources and a 20% reduction in power, with only 0.7% reduction in classification accuracy.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116295411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Balaji Raman, Ayoub Nouri, Deepak Gangadharan, M. Bozga, A. Basu, Mayur Maheshwari, Axel Legay, S. Bensalem, S. Chakraborty
{"title":"Stochastic modeling and performance analysis of multimedia SoCs","authors":"Balaji Raman, Ayoub Nouri, Deepak Gangadharan, M. Bozga, A. Basu, Mayur Maheshwari, Axel Legay, S. Bensalem, S. Chakraborty","doi":"10.1109/SAMOS.2013.6621117","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621117","url":null,"abstract":"Reliability and flexibility are among the key required features of a framework used to model a system. Existing approaches to design resource-constrained, soft-real time systems either provide guarantees for output quality or account for loss in the system, but not both. We propose two independent solutions where each modeling technique has both the above mentioned characteristics. We present a probabilistic analytical framework and a statistical model checking approach to design system-on-chips for low-cost multimedia systems. We apply the modeling techniques to size the output buffer in a video decoder. The results shows that, for our stochastic design metric, the analytical framework upper bounds (and relatively accurate) compare to the statistical model checking technique. Also, we observed significant reduction in resource usage (such as output buffer size) with tolerable loss in output quality.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128203118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Poss, M. Lankamp, Qiang Yang, Jian Fu, M. I. Uddin, C. Jesshope
{"title":"MGSim — A simulation environment for multi-core research and education","authors":"R. Poss, M. Lankamp, Qiang Yang, Jian Fu, M. I. Uddin, C. Jesshope","doi":"10.1109/SAMOS.2013.6621109","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621109","url":null,"abstract":"This article presents MGSim1, an open source discrete event simulator for on-chip hardware components developed at the University of Amsterdam. MGSim is used as research and teaching vehicle to study the fine-grained hardware/software interactions on many-core chips with and without hardware multithreading. MGSim's component library includes support for core models with different instruction sets, a configurable interconnect, multiple configurable cache and memory models, a dedicated I/O subsystem, and comprehensive monitoring and interaction facilities. The default model configuration shipped with MGSim implements Microgrids, a multi-core architecture with hardware concurrency management. MGSim is furthermore written mostly in C++ and uses object classes to represent chip components. It is optimized for architecture models that can be described as process networks.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132882405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic task mapping onto multi-core architectures through stream rewriting","authors":"Lars Middendorf, C. Zebelein, C. Haubelt","doi":"10.1109/SAMOS.2013.6621123","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621123","url":null,"abstract":"Task graphs provide an efficient model of computation for specification, analysis, and implementation of concurrent applications. In this paper, we present a novel approach for mapping the class of series-parallel task graphs onto multi-core architectures based on pattern matching. Both the topology of the graph and the state of the tasks are encoded as a stream of tokens, which is iteratively rewritten at multiple positions in parallel. Hence, our technique is most useful for compute-intensive applications that must adapt to frequently varying and unpredictable workload at runtime. Several complex examples have been evaluated on a multi-core architecture and the experimental results show the effectiveness of our approach.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130537725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlexCore: Implementing an exposed datapath processor","authors":"Magnus Själander, P. Larsson-Edefors","doi":"10.1109/SAMOS.2013.6621139","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621139","url":null,"abstract":"The FlexCore processor is the resulting implementation of an exposed datapath approach conceptualized in the FlexSoC programme. By way of a crossbar switch interconnect, all execution units in a FlexCore datapath can potentially communicate, allowing the inherent hardware parallelism to be utilized. This interconnect enables configuration of a datapath to match an application domain, for example, by way of datapath accelerators. The baseline FlexCore is a general-purpose processor (GPP) and since all FlexCore configurations are extensions to the baseline, they offer GPP functionality as complement to the domain-specific functionality. This paper gives an overview of the implementation of complete FlexCore processors, accompanied with discussions on datapath interconnects, datapath extensions and instruction decompression.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123375010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking computer architecture for throughput computing","authors":"Wen-mei W. Hwu","doi":"10.1109/SAMOS.2013.6621096","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621096","url":null,"abstract":"Summary form only given. The rise of rich media in mobile devices and massive analytics in data centers has created new opportunities and challenges for computer architects. On one hand, commercial computer organizations have been undergoing fast transformation to drastically increase the throughput of processing large amounts of data while keeping the power consumption in check. On the other hand, computer architecture has evolved too slowly to facilitate hardware innovations, software productivity, algorithm advancement and user perceived improvements. In this talk, I will present some major challenges facing the computer architecture research community and some recent advancements in throughput computing. I will conclude by arguing that we must rethink the scope of computer architecture research as we seek to create growth paths for the computer systems industry.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130757839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TimeCube: A manycore embedded processor with interference-agnostic progress tracking","authors":"Anshuman Gupta, J. Sampson, M. Taylor","doi":"10.1109/SAMOS.2013.6621127","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621127","url":null,"abstract":"Recently introduced processors such as Tilera's Tile Gx100 and Intel's 48-core SCC have delivered on the promise of high performance per watt in manycore processors, making these architectures ostensibly as attractive for low-power embedded processors as for cloud services. However, these architectures space-multiplex the microarchitectural resources between many threads to increase utilization, which leads to potentially large and varying levels of interference. This decorrelates CPU-time from actual application progress and decreases the ability of traditional software to accurately track and finely control application progress, hindering the adoption of manycore processors in embedded computing. In this paper we propose Progress Time as the counterpart of CPU-time in space-multiplexed systems and show how it can be used to track application progress. We also introduce TimeCube, a manycore embedded processor that uses dynamic execution isolation and shadow performance modeling to provide an accurate online measurement of each application's Progress Time. Our evaluation shows that a 32-core TimeCube processor can track application progress with less than 1% error even in the presence of a 6× average worst-case slowdown. TimeCube also uses Progress Times to perform online architectural resource management that leads to a 36% improvement in throughput compared to existing microarchitectural resource allocation schemes. Overall, the results argue for adding the requisite microarchitectural structures to support Progress Time in manycore chips for embedded systems.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpenCL code generation for low energy wide SIMD architectures with explicit datapath","authors":"Dongrui She, Yifan He, Luc Waeijen, H. Corporaal","doi":"10.1109/SAMOS.2013.6621141","DOIUrl":"https://doi.org/10.1109/SAMOS.2013.6621141","url":null,"abstract":"Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes explicit datapath to achieve high energy efficiency. To efficiently program the proposed architecture with a standard parallel programming language, we introduce a tool flow that can compile and map OpenCL programs onto it. The compiler in the proposed tool flow is able to analyze the static access patterns in OpenCL kernels and generate efficient mapping and code that utilizes the explicit datapath. Experimental results show that the proposed architecture is efficient. In a 128-PE processor, the proposed architecture is able to achieve over 200 times speed-up and reduce the energy consumption of register file and memory by over 90% compared to a RISC processor.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124819276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}