{"title":"Fast cosimulation of transformative systems with OS support on SMP computer","authors":"Zhengting He, A. Mok","doi":"10.1145/1016720.1016761","DOIUrl":"https://doi.org/10.1145/1016720.1016761","url":null,"abstract":"Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123328766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Rivera, M. Sanchez-Elez, M. Fernandez, R. Hermida, N. Bagherzadeh
{"title":"Efficient mapping of hierarchical trees on coarse-grain reconfigurable architectures","authors":"F. Rivera, M. Sanchez-Elez, M. Fernandez, R. Hermida, N. Bagherzadeh","doi":"10.1145/1016720.1016731","DOIUrl":"https://doi.org/10.1145/1016720.1016731","url":null,"abstract":"Reconfigurable architectures have become increasingly important in years. We present an approach to the problem of executing 3D graphics interactive applications onto these architectures. The hierarchical trees are usually implemented to reduce the data processed, thereby diminishing the execution time. We have developed a mapping scheme that parallelizes the tree execution onto a SIMD reconfigurable architecture. This mapping scheme considerably reduces the time penalty caused by the possibility of executing different tree nodes in SIMD fashion. We have developed a technique that achieves an efficient hierarchical tree execution taking decisions at execution time. It also promotes the possibility of data coherence in order to reduce the execution time. The experimental results show high performance and efficient resource utilization on tested applications.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"22 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114015214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting overflow detection","authors":"V. Kotlyar, M. Moudgill","doi":"10.1145/1016720.1016732","DOIUrl":"https://doi.org/10.1145/1016720.1016732","url":null,"abstract":"Fixed-point saturating arithmetic is widely used in media and digital signal processing applications. A number of processor architectures provide instructions that implement saturating operations. However, standard high-level languages, such as ANSI C, provide no direct support for saturating arithmetic. Applications written in standard languages have to implement saturating operations in terms of basic two's complement operations. In order to provide fast execution of such programs it is important to have an optimizing compiler automatically detect and convert appropriate code fragments to hardware instructions. We present a set of techniques for automatic recognition of saturating arithmetic operations. We show that in most cases the recognition problem is simply one of Boolean circuit equivalence. Given the expense of solving circuit equivalence, we develop a set of practical approximations based on abstract interpretation. Experiments show that our techniques, while reliably recognizing saturating arithmetic, have small compile-time overhead. We also demonstrate that our approach is not limited to saturating arithmetic, but is directly applicable to recognizing other idioms, such as add-with-carry and absolute value.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130265077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power analysis of system-level on-chip communication architectures","authors":"K. Lahiri, A. Raghunathan","doi":"10.1145/1016720.1016777","DOIUrl":"https://doi.org/10.1145/1016720.1016777","url":null,"abstract":"For complex system-on-chips (SoCs) fabricated in nanometer technologies, the system-level on-chip communication architecture is emerging as a significant source of power consumption. Managing and optimizing this important component of SoC power requires a detailed understanding of the characteristics of its power consumption. Various power estimation and low-power design techniques have been proposed for the global interconnects that form part of SoC communication architectures (e.g., low-swing buses, bus encoding, etc). While effective, they only address a limited part of communication architecture power consumption. A state-of-the-art communication architecture, viewed in its entirety, is quite complex, comprising several components, such as bus interfaces, arbiters, bridges, decoders, and multiplexers, in addition to the global bus lines. Relatively little research has focused on analyzing and comparing the power consumed by different components of the communication architecture. In this work, we present a systematic evaluation and analysis of the power consumed by a state-of-the-art communication architecture (the AMBA on-chip bus), using a commercial design flow. We focus on developing a quantitative understanding of the relative contributions of different communication architecture components to its power consumption, and the factors on which they depend. We decompose the communication architecture power into power consumed by logic components (such as arbiters, decoders, bus bridges), global bus lines (that carry address, data, and control information), and bus interfaces. We also perform studies that analyze the impact of varying application traffic characteristics, and varying SoC complexity, on communication architecture power. Based on our analyses, we evaluate different techniques for reducing the power consumed by the on-chip communication architecture, and compare their effectiveness in achieving power savings at the system level. In addition to quantitatively reinforcing the view that on-chip communication is an important target for system-level power optimization, our work demonstrates (i) the importance of considering the communication architecture in its entirety, and (ii) the opportunities that exist for power reduction through careful communication architecture design.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124387532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory system design space exploration for low-power, real-time speech recognition","authors":"R. Krishna, S. Mahlke, T. Austin","doi":"10.1145/1016720.1016756","DOIUrl":"https://doi.org/10.1145/1016720.1016756","url":null,"abstract":"The recent proliferation of computing technology has generated new interest natural I/O interface technologies such as speech recognition. Unfortunately, the computational and memory demands of such applications currently prohibit their use on low-power portable devices in anything more than their simplest forms. Previous work has demonstrated that the thread level concurrency inherent in this application domain can be used to dramatically improve performance with minimal impact on overall system energy consumption, but that such benefits are severely constrained by memory system bandwidth. This work presents a design space exploration of potential memory system architectures. A range of low-power memory organizations are considered, from conventional caching to more advanced system-on-chip implementations. We find that, given architectures able to exploit concurrency in this domain, large L2 based cache hierarchies and high bandwidth memory systems employing data stream partitioning and on-chip embedded DRAM and ROM technologies can provide much of the performance of idealized memory systems without violating the power constraints of the low-power domain.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"293 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120923742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient search space exploration for HW-SW partitioning","authors":"S. Banerjee, N. Dutt","doi":"10.1145/1016720.1016752","DOIUrl":"https://doi.org/10.1145/1016720.1016752","url":null,"abstract":"Hardware/software (HW-SW) partitioning is a key problem in the codesign of embedded systems, studied extensively in the past. One major open challenge for traditional partitioning approaches - as we move to more complex and heterogeneous SoCs - is the lack of efficient exploration of the large space of possible HW/SW configurations, coupled with the inability to efficiently scale up with larger problem sizes. We make two contributions for HW-SW partitioning of applications represented as procedural call-graphs: 1) we prove that during partitioning, the execution time metric for moving a vertex needs to be updated only for the immediate neighbours of the vertex, rather than for all ancestors along paths to the root vertex; consequently, we observe faster run-times for move-based partitioning algorithms such as simulated annealing (SA), allowing call graphs with thousands of vertices to be processed in less than a second, and 2) we devise a new cost function for SA that allows frequent discovery of better partitioning solutions by searching spaces overlooked by traditional SA cost functions. We present experimental results on a very large design space, where several thousand configurations are explored in minutes as compared to several hours or days using a traditional SA formulation. Furthermore, our approach is frequently able to locate better design points with over 10 % improvement in application execution time compared to the solutions generated by a Kernighan-Lin partitioning algorithm starting with an all-SW partitioning.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134266771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic overlay of scratchpad memory for energy minimization","authors":"Manish Verma, L. Wehmeyer, P. Marwedel","doi":"10.1109/CODES+ISSS.2004.20","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.20","url":null,"abstract":"The memory subsystem accounts for a significant portion of the aggregate energy budget of contemporary embedded systems. Moreover, there exists a large potential for optimizing the energy consumption of the memory subsystem. Consequently, novel memories as well as novel algorithms for their efficient utilization are being designed. Scratchpads are known to perform better than caches in terms of power, performance, area and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. We present an allocation technique which analyzes the application and inserts instructions to dynamically copy both code segments and variables onto the scratchpad at runtime. We demonstrate that the problem of dynamically overlaying scratchpad is an extension of the global register allocation problem. The overlay problem is solved optimally using ILP formulation techniques. Our approach improves upon the only previously known allocation technique for statically allocating both variables and code segments onto the scratchpad. Experiments report an average reduction of 34% and 18% in the energy consumption and the runtime of the applications, respectively. A minimal increase in code size is also reported.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134521550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Weber, M. Moskewicz, M. Gries, C. Sauer, K. Keutzer
{"title":"Fast cycle-accurate simulation and instruction set generation for constraint-based descriptions of programmable architectures","authors":"S. Weber, M. Moskewicz, M. Gries, C. Sauer, K. Keutzer","doi":"10.1145/1016720.1016728","DOIUrl":"https://doi.org/10.1145/1016720.1016728","url":null,"abstract":"State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. We introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing clement are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycle-accurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115911405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. V. D. Wolf, E. Kock, T. Henriksson, W. Kruijtzer, G. Essink
{"title":"Design and programming of embedded multiprocessors: an interface-centric approach","authors":"P. V. D. Wolf, E. Kock, T. Henriksson, W. Kruijtzer, G. Essink","doi":"10.1109/CODES+ISSS.2004.17","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.17","url":null,"abstract":"We present design technology for the structured design and programming of embedded multi-processor systems. It comprises a task-level interface that can be used both for developing parallel application models and as a platform interface for implementing applications on multi-processor architectures. Associated mapping technology supports refinement of application models towards implementation. By linking application development and implementation aspects, the technology integrates the specification and design phases in the MPSoC design process. Two design cases demonstrate the efficient implementation of the platform interface on different architectures. Industry-wide standardization of a task-level interface can facilitate reuse of function-specific hardware/software modules across companies.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121553776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling operation and microarchitecture concurrency for communication architectures with application to retargetable simulation","authors":"Xinping Zhu, W. Qin, S. Malik","doi":"10.1145/1016720.1016738","DOIUrl":"https://doi.org/10.1145/1016720.1016738","url":null,"abstract":"In multiprocessor based SoCs, optimizing the communication architecture is often as important as, if not more than, optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of computation architectures, the same is not true for the communication architectures. A major challenge in modeling the communication architecture is managing the concurrency at multiple levels: at the operation level, multiple communication operations may be active at any time; at the microarchitecture level, several microarchitectural components may be operating in parallel. Further, it is important to be able to clearly specify how the operation level concurrency maps to the microarchitectural level concurrency. This work presents a modeling methodology and a retargetable simulation framework which fill this gap. This framework seeks to facilitate the design space exploration of the communication sub-system through a rigorous modeling approach based on a formal concurrency model, the operation state machine (OSM). We first introduce the basic notions and concepts of OSM and show by example how this model can be used to represent the inherent concurrency in the architecture and microarchitecture of processors. Then we demonstrate the applicability of OSM in modeling on-chip communication architectures (OCAs) by walking though a router based packet switching network example and a bus example. Due to the fact that the OSM model is naturally suited to handle the operation and microarchitecture level concurrencies of OCAs as well, our OSM-based modeling methodology enables the entire system including both the computation and communication architectures to be modeled in a single OSM framework. This allows us to develop a tool set that can synthesize cycle-accurate system simulators for multi-PE SoC prototypes. To demonstrate the flexibility of this methodology, we choose two distinct system configurations with different types of OCA: a 4/spl times/4 mesh network of 16 PEs, and a cluster of 4 PEs connected by a bus. We show that by simulation, critical system information such as timing and communication patterns can be obtained and evaluated. Consequently, system-level design choices regarding the communication architecture can be made with high confidence in early stages of design. In addition to improving design quality, this methodology also results in significantly shortened design-time.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130963573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}