Selim Gurun, Ye Wen, Navraj Chohan, R. Wolski, C. Krintz
{"title":"SimGate: Full-System, Cycle-Close Simulation of the Stargate Sensor Network Intermediate Node","authors":"Selim Gurun, Ye Wen, Navraj Chohan, R. Wolski, C. Krintz","doi":"10.1109/ICSAMOS.2006.300819","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300819","url":null,"abstract":"We present SimGate - a full-system simulator for the Stargate intermediate-level, resource-constrained, sensor network device. We empirically evaluate the accuracy and performance of the system in isolation as well as coupled with simulated Mica2 motes. Our system is functionally correct and achieves accurate cycle estimation (i.e. cycle-close). Moreover, the overhead of simulated execution is modest with respect to previously published work","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127556353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-constrained Block Processing Optimization for Synthesis of DSP Software","authors":"Ming-Yung Ko, Chung-Ching Shen, S. Bhattacharyya","doi":"10.1109/ICSAMOS.2006.300820","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300820","url":null,"abstract":"Digital signal processing (DSP) applications involve processing long streams of input data. It is important to take into account this form of processing when implementing embedded software for DSP systems. Task-level vectorization, or block processing, is a useful dataflow graph transformation that can significantly improve execution performance by allowing subsequences of data items to be processed through individual task invocations. In this way, several benefits can be obtained, including reduced context switch overhead, increased memory locality, improved utilization of processor pipelines, and use of more efficient DSP-oriented addressing modes. On the other hand, block processing generally results in increased memory requirements since it effectively increases the sizes of the input and output values associated with processing tasks. In this paper, we investigate the memory-performance tradeoff associated with block processing. We develop novel block processing algorithms that take carefully take into account memory constraints to achieve efficient block processing configurations within given memory space limitations. Our experimental results indicate that these methods derive optimal memory-constrained block processing solutions most of the time. We demonstrate the advantages of our block processing techniques on practical kernel functions and applications in the DSP domain","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126925863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Memory Implementation for Arbitrary Stride Accesses","authors":"E. Aho, Jarno Vanne, T. Hämäläinen","doi":"10.1109/ICSAMOS.2006.300801","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300801","url":null,"abstract":"Parallel memory modules can be used to increase memory bandwidth and feed a processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the data patterns have equal amount of accessed data elements as the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124025802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reduction of Energy Consumption in Processors by Early Detection and Bypassing of Trivial Operations","authors":"Md. Mafijul Islam, P. Stenström","doi":"10.1109/ICSAMOS.2006.300805","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300805","url":null,"abstract":"Previous research has established that trivial operations, i.e., instructions whose outcome can be trivially inferred from the operands, e.g. addition of zero, account for a quite significant portion of the dynamically executed instructions. By detecting them early and removing them from the pipeline, it is possible to reduce the energy consumption. This paper first presents a new classification of trivial operations in which especially such trivial operations that can be detected early, i.e. at the decode stage, in the pipeline are identified. Our analysis shows that on average as many as 10% of all executed instructions are of this kind across 12 applications from SPEC2000. We find that a majority (indeed 89%) of them are identity-trivial in which at least one of the operands is the identity element - zero or one. By detecting them early, one can bypass their execution and eliminate register accesses if the processor uses a logical/physical register remapping unit. We find that as many as 75% of all trivial operations can be detected and eliminated at the decode stage because the identity element is available that often. With such support, we find that the energy consumption in the functional units, the result bus, the instruction window infrastructure, and the register file can be reduced by 13%, 9%, 27%, and 26%, respectively yielding 18% reduction of the energy in the core pipeline","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"296 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114845541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FLUX Networks: Interconnects on Demand","authors":"S. Vassiliadis, I. Sourdis","doi":"10.1109/ICSAMOS.2006.300823","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300823","url":null,"abstract":"In this paper, we introduce the FLUX interconnection networks, a scheme where the interconnections of a parallel system are established on demand before or during program execution. We present a programming paradigm which can be utilized to make the proposed solution feasible. We perform several experiments to show the viability of our approach. We experiment on three case studies, evaluate different algorithms, developed for meshes or binary trees, and map them on \"grid\"-like physical interconnection networks. Our results clearly show that, based on the underlying network, different mappings are suitable for different algorithms. Even for a single algorithm different mappings are more appropriate, when the processing data size or the number of utilized nodes changes. The implication of the above is that changing interconnection topologies/mappings (dynamically) on demand depending on the program needs can be beneficial","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115938366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On-Chip Communication in Run-Time Assembled Reconfigurable Systems","authors":"P. Sedcole, P. Cheung, G. Constantinides, W. Luk","doi":"10.1109/ICSAMOS.2006.300824","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300824","url":null,"abstract":"Embedded systems in field-programmable gate arrays can be customised and adaptive if assembled from modular components at run time. This paper describes techniques for modelling inter-module channel behaviour based on statistical time division multiplexing. Where modules communicate over shared media, the proposed techniques enable systematic development of on-chip communication infrastructure to support run-time instantiation of components. Our techniques also allow system designers to guarantee that logical communication requirements between the adjunct modules can be satisfied by the infrastructure. An in-depth analysis is presented, and then verified with cycle-accurate simulations for the Sonic-on-chip reconfigurable platform for real-time video applications","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129516416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simone Borgio, Davide Bosisio, Fabrizio Ferrandi, M. Monchiero, M. Santambrogio, D. Sciuto, Antonino Tumeo
{"title":"Hardware DWT accelerator for MultiProcessor System-on-Chip on FPGA","authors":"Simone Borgio, Davide Bosisio, Fabrizio Ferrandi, M. Monchiero, M. Santambrogio, D. Sciuto, Antonino Tumeo","doi":"10.1109/ICSAMOS.2006.300816","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300816","url":null,"abstract":"High performance multimedia applications are typical targets of today embedded systems. These applications, complex both in terms of execution flow and amount of elaborated data, can be well addressed by multiprocessor systems on-chip (MPSoCs). MPSoCs are composed of simple processors and memories tightly interconnected with fast communication channels and customized IP cores for the most demanding functions can be implemented and attached to these systems to enhance performance even more. Reconfigurable devices like FPGA, can act as a target, even programmed at runtime, for the custom IP cores, or as a prototyping platform for the whole system. Image compression like JPEG2000, can benefit very much from this approach and this type of architectures. This paper shows how the most demanding task of the JPEG2000 compression algorithm, the two-dimensional discrete wavelet transform, can be hardware accelerated and implemented in a multiprocessor system-on-chip prototyping platform on field programmable gate array (FPGA), CerberO. Architectures with different number of processors and hardware accelerators, shared among the processors or dedicated, have been implemented. To validate the approach, we show some experimental results on the platform with the hardware and the software implementation of the transformation","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Monchiero, G. Palermo, C. Silvano, Oreste Villa
{"title":"Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors","authors":"M. Monchiero, G. Palermo, C. Silvano, Oreste Villa","doi":"10.1109/ICSAMOS.2006.300821","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300821","url":null,"abstract":"Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoC), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs, since its function is to provide data exchange and synchronization support. In this paper, a distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC. In particular, the paper focuses on the energy/delay exploration of on-chip physically distributed and logically shared memory address space for MP-SoCs based on a parameterizable NoC. The data allocation on the physically distributed shared memory space is dynamically managed by an on-chip hardware memory management unit. Experimental results show the impact of different NoC topologies and distributed shared memory configurations for a selected set of parallel benchmark applications from the power/performance perspective","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132620246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Ykman-Couvreur, V. Nollet, T. Marescaux, E. Brockmeyer, F. Catthoor, H. Corporaal
{"title":"Pareto-Based Application Specification for MP-SoC Customized Run-Time Management","authors":"C. Ykman-Couvreur, V. Nollet, T. Marescaux, E. Brockmeyer, F. Catthoor, H. Corporaal","doi":"10.1109/ICSAMOS.2006.300812","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300812","url":null,"abstract":"In an MP-SoC environment, a customized run-time management should be incorporated on top of the basic OS services to globally optimize costs (e.g. energy consumption) across all active applications, according to constraints (e.g. performance, user requirements) and available platform resources. To that end, we have proposed a Pareto-based approach combining a design-time application mapping and platform exploration with a low-complexity run-time manager. This allows to alleviate the OS in its run-time decision making and to avoid conservative worst-case assumptions. In this paper, we focus on the characterization of the Pareto-based application specification, resulting from our design-time exploration. This specification is essential as input for our run-time manager. A representative video codec multimedia application, simulated on our MP-SoC platform simulator, is used as case study. For the resulting Pareto-based specification, both binary size and performance overhead is negligible","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":" 50","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132189161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rainer Ohlendorf, Thomas Wild, Michael Meitinger, Holm Rauchfuss, A. Herkersdorf
{"title":"Performance Evaluation of RISC-based SoC Platforms in Network Processing Applications","authors":"Rainer Ohlendorf, Thomas Wild, Michael Meitinger, Holm Rauchfuss, A. Herkersdorf","doi":"10.1109/ICSAMOS.2006.300822","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300822","url":null,"abstract":"In this paper, results of a simulative performance evaluation of RISC-based SoC platforms for networking applications are presented. We use our SystemC simulation environment that is calibrated with a reference implementation on an FPGA-based prototyping environment, consisting of a single RISC-CPU, memory system, Ethernet MAC and an autonomous DMA engine. In order to achieve precise results, a real IP stack has been profiled. Starting with an analysis of the reference scenario, two approaches for improvements are investigated. At first, hardware assists are added, which offload the CPU from compute-intensive bit-level manipulations. Second, the concept of flexible processing paths as proposed in FlexPath NP with AutoRoute is evaluated, in which some part of the traffic can bypass the central CPU cluster. For each of the three scenarios the maximum throughput is determined, and the improvements and limitations of each solution are discussed. It can be shown that a FlexPath NP achieves up to 2.5 times the throughput of the unoptimized reference scenario under realistic traffic assumptions","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114434731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}