{"title":"On the Evaluation of Dense Chip-Multiprocessor Architectures","authors":"Francisco J. Villa, M. Acacio, José M. García","doi":"10.1109/ICSAMOS.2006.300804","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300804","url":null,"abstract":"Chip-multiprocessors (CMPs) have been revealed as the most promising way of making efficient use of current improvements in integration scale. Nowadays, commercial CMP releases integrate at most 8 processor cores onto the chip. However, 16 or more processor cores are expected to be offered in near future dense-CMP (D-CMP) systems. In this way, these architectures impose new design restrictions, and some topics, such as the cache-coherence problem, must be reviewed. In this paper we present an exhaustive performance evaluation of two recently proposed D-CMP architectures, making special emphasis on the solution to the cache-coherence problem that each one of them introduces. The shared bus fabric architecture (SBF) features a snoop cache-coherence protocol and is based on a high-performance bus fabric interconnection network. The second architecture follows a directory-based approach and integrates a bi-dimensional mesh as the interconnection network. Our results show that the performance achieved by the SBF architecture is hard-limited by the bandwidth restrictions of the bus fabric. On the other hand, the directory-based architecture outperforms the SBF one, but presents some performance inefficiencies due to the additional indirection that the directory structure stored in the L2 cache level introduces","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122717352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating RTL Simulation by Several Orders of Magnitude Using Clock Suppression","authors":"H. Muhr, Roland Höler","doi":"10.1109/ICSAMOS.2006.300818","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300818","url":null,"abstract":"In recent years designers of embedded computer systems face a tremendous growth in complexity of their systems. This, together with the fact that the used system clock frequencies rise and that the real time required to see features start up and work correctly in an embedded system also increases, let skyrocket the simulation times of event based simulation engines. Performing these simulations on register transfer level (RTL), however, is crucial to achieve functional verification of embedded computer systems. The acceleration of such event based simulations thus is the aim of the work presented in this paper. To this end a methodology called clock suppression is presented and thoroughly discussed. To underpin the feasibility and performance of this approach, evaluation results of simulation experiments for several designs will be shown","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"297 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127639278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Improvements in Microprocessor Systems Utilizing a Coprocessor Data-Path","authors":"M. D. Galanis, G. Dimitroulakos, C. Goutis","doi":"10.1109/ICSAMOS.2006.300813","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300813","url":null,"abstract":"The speedups achieved in a generic microprocessor system by employing a high-performance data-path are presented in this work. The data-path acts as a coprocessor that accelerates computational intensive kernel regions thereby increasing the overall performance. It is composed by flexible computational components (FCCs) that can realize any two-level template of primitive operations. The automated coprocessor synthesis method and its integration to a design flow for executing applications on the system is presented. Analytical exploration in respect to the type of the custom data-path and to the microprocessor architecture is performed. The overall application speedups of eight real-life applications, relative to the software execution on the microprocessor, are estimated using the design flow. These speedups range from 1.75 to 3.95, having an average value of 2.72, while the overhead in circuit area is small. A comparison with another high-performance data-path showed that the proposed coprocessor achieves better performance while having smaller area-time products for the generated data-paths","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124812567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Hessabi, M. Modarressi, M. Goudarzi, Hani JavanHemmat
{"title":"A Table-Based Application-Specific Prefetch Engine for Object-Oriented Embedded Systems","authors":"S. Hessabi, M. Modarressi, M. Goudarzi, Hani JavanHemmat","doi":"10.1109/ICSAMOS.2006.300802","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300802","url":null,"abstract":"A table-based application-specific data prefetching mechanism is presented in this paper. This mechanism is proposed to improve the performance of the application specific instruction-set processors (ASIP) we develop customized to an object-oriented application. In this approach, we divide the data accesses of a class method into two conditional and unconditional parts. We supply the prefetch engine with the static information about each part to prefetch all data fields of an object required by a class method when the class method is invoked. Effective management of memory access patterns by dividing them based on the method to which they belong and storing the access information of nested loops using a simple structure are the merits of the proposed mechanism. In addition, by adding a prefetch flag to cache blocks, we eliminate a large number of prefetch related tag comparisons. The results show that the proposed mechanism reduces the cache miss ratio and prefetch related tag comparisons on average by 66% and 21%, respectively","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134103185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parameterized Mapping of Algorithms onto Processor Arrays with Sub-Word Parallelism","authors":"Rainer Schaffer, R. Merker","doi":"10.1109/ICSAMOS.2006.300815","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300815","url":null,"abstract":"Upcoming processor architectures support parallel processing on different levels. Multiple processing elements (PEs) run in parallel. The PEs consists of several functional units and the functional units allow sub-word parallelism (SWP), i.e. the parallel execution of operations with low data word width. In this paper, a parameterized mapping of algorithms onto massively parallel processor architectures (PAs) is derived which exploits both parallelism on PA and SWP on PE level. It establishes a correlation between the parameters of the algorithms and the parameters of the PA, which enables optimization strategies with respect to several expense factors of the PA. The design approach is based on the co-partitioning method and the partitioning of data dependencies. Both are used in a hierarchical manner. Besides the parameters of the PA (such as shape, number of PEs, number of sub-words processed in parallel, channels between the PEs, and their delay), the packing instructions for exploiting SWP can be deduced","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127313813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chip Size Estimation for SOC Design Space Exploration","authors":"H. Jeschke","doi":"10.1016/j.sysarc.2007.01.012","DOIUrl":"https://doi.org/10.1016/j.sysarc.2007.01.012","url":null,"abstract":"","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"118842030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Area-Aware Optimizations for Resource Constrained Branch Predictors Exploited in Embedded Processors","authors":"Babak Salamat, A. Baniasadi, K. J. Deris","doi":"10.1109/ICSAMOS.2006.300808","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300808","url":null,"abstract":"Modern embedded processors (e.g., Intel's XScale) use small and simple branch predictors to improve performance. Such predictors impose little area and power overhead but may offer low accuracy. As a result, branch misprediction rate could be high. Such mispredictions result in longer program runtime and wasted activity. To address this inefficiency, we introduce two optimization techniques: first, we introduce an adaptive and low-complexity branch prediction technique. Our branch predictor removes up to a maximum of 50% of the branch mispredictions of a bimodal predictor. This results in improving performance by up to 16%. Second, we present front-end gating techniques and reduce wasted activity up to a maximum of 32%","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125150624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static Energy Saving Through Multi-Bank Memory Architecture","authors":"S. Lafond, J. Lilius","doi":"10.1109/ICSAMOS.2006.300807","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300807","url":null,"abstract":"Managing the energy consumption of embedded systems has become a major problem with the increasing demand for portable electronic devices. This paper proposes a multi-bank memory architecture as a solution to decrease the static energy cost in memory. We set up the equations ruling the optimization problem for decreasing the memory static energy cost, analyze the impact of different parameters on the energy cost and finally present some case study results","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121525254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Ascia, V. Catania, A. D. Nuovo, M. Palesi, Davide Patti
{"title":"An Efficient Hierarchical Fuzzy Approach for System Level System-on-a-Chip Design","authors":"G. Ascia, V. Catania, A. D. Nuovo, M. Palesi, Davide Patti","doi":"10.1109/ICSAMOS.2006.300817","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300817","url":null,"abstract":"One of the most important bottleneck in the overall design flow of a complex embedded system is due to the simulation. Simulation occurs at every phase of the design flow. In this paper we focus on system level design proposing a novel approach to speed up the evaluation of a system configuration. The approach, which uses a fuzzy system as approximator for the different components of the system, is used to accelerate the design space exploration of a VLIW based parameterized SoC platform for the optimization of both performance and power dissipation. The experiments carried out on a multimedia benchmark suite, demonstrate the scalability and the accuracy of the proposed approach","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115919400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modified Hotspot Cache Architecture: A Low Energy Fast Cache for Embedded Processors","authors":"K. Ali, M. Aboelaze, S. Datta","doi":"10.1109/ICSAMOS.2006.300806","DOIUrl":"https://doi.org/10.1109/ICSAMOS.2006.300806","url":null,"abstract":"The cache memory plays a crucial role in the performance of any processor. The cache memory (SRAM), especially the on chip cache, is 3-4 times faster than the main memory (DRAM). It can vastly improve the processor performance and speed. Also the cache consumes much less energy than the main memory. That leads to a huge power saving which is very important for embedded applications. In today's processors, although the cache memory reduces the energy consumption of the processor, however the energy consumption in the on-chip cache account to almost 40% of the total energy consumption of the processor. In this paper, we propose a cache architecture, for the instruction cache, that is a modification of the hotspot architecture. Our proposed architecture consists of a small filter cache in parallel with the hotspot cache, between the L1 cache and the main memory. The small filter cache is to hold the code that was not captured by the hotspot cache. We also propose a prediction mechanism to steer the memory access to either the hotspot cache, the filter cache, or the L1 cache. Our design has both a faster access time and less energy consumption compared to both the filter cache and the hotspot cache architectures. We use Mibench and Mediabench benchmarks, together with the simplescalar simulator in order to evaluate the performance of our proposed architecture and compares it with the filter cache and the hotspot cache architectures. The simulation results show that our design outperforms both the filter cache and the hotspot cache in both the average memory access time and the energy consumption","PeriodicalId":204190,"journal":{"name":"2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114723083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}