{"title":"Fast cosimulation of transformative systems with OS support on SMP computer","authors":"Zhengting He, A. Mok","doi":"10.1145/1016720.1016761","DOIUrl":"https://doi.org/10.1145/1016720.1016761","url":null,"abstract":"Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123328766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Rivera, M. Sanchez-Elez, M. Fernandez, R. Hermida, N. Bagherzadeh
{"title":"Efficient mapping of hierarchical trees on coarse-grain reconfigurable architectures","authors":"F. Rivera, M. Sanchez-Elez, M. Fernandez, R. Hermida, N. Bagherzadeh","doi":"10.1145/1016720.1016731","DOIUrl":"https://doi.org/10.1145/1016720.1016731","url":null,"abstract":"Reconfigurable architectures have become increasingly important in years. We present an approach to the problem of executing 3D graphics interactive applications onto these architectures. The hierarchical trees are usually implemented to reduce the data processed, thereby diminishing the execution time. We have developed a mapping scheme that parallelizes the tree execution onto a SIMD reconfigurable architecture. This mapping scheme considerably reduces the time penalty caused by the possibility of executing different tree nodes in SIMD fashion. We have developed a technique that achieves an efficient hierarchical tree execution taking decisions at execution time. It also promotes the possibility of data coherence in order to reduce the execution time. The experimental results show high performance and efficient resource utilization on tested applications.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"22 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114015214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting overflow detection","authors":"V. Kotlyar, M. Moudgill","doi":"10.1145/1016720.1016732","DOIUrl":"https://doi.org/10.1145/1016720.1016732","url":null,"abstract":"Fixed-point saturating arithmetic is widely used in media and digital signal processing applications. A number of processor architectures provide instructions that implement saturating operations. However, standard high-level languages, such as ANSI C, provide no direct support for saturating arithmetic. Applications written in standard languages have to implement saturating operations in terms of basic two's complement operations. In order to provide fast execution of such programs it is important to have an optimizing compiler automatically detect and convert appropriate code fragments to hardware instructions. We present a set of techniques for automatic recognition of saturating arithmetic operations. We show that in most cases the recognition problem is simply one of Boolean circuit equivalence. Given the expense of solving circuit equivalence, we develop a set of practical approximations based on abstract interpretation. Experiments show that our techniques, while reliably recognizing saturating arithmetic, have small compile-time overhead. We also demonstrate that our approach is not limited to saturating arithmetic, but is directly applicable to recognizing other idioms, such as add-with-carry and absolute value.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130265077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic overlay of scratchpad memory for energy minimization","authors":"Manish Verma, L. Wehmeyer, P. Marwedel","doi":"10.1109/CODES+ISSS.2004.20","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.20","url":null,"abstract":"The memory subsystem accounts for a significant portion of the aggregate energy budget of contemporary embedded systems. Moreover, there exists a large potential for optimizing the energy consumption of the memory subsystem. Consequently, novel memories as well as novel algorithms for their efficient utilization are being designed. Scratchpads are known to perform better than caches in terms of power, performance, area and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. We present an allocation technique which analyzes the application and inserts instructions to dynamically copy both code segments and variables onto the scratchpad at runtime. We demonstrate that the problem of dynamically overlaying scratchpad is an extension of the global register allocation problem. The overlay problem is solved optimally using ILP formulation techniques. Our approach improves upon the only previously known allocation technique for statically allocating both variables and code segments onto the scratchpad. Experiments report an average reduction of 34% and 18% in the energy consumption and the runtime of the applications, respectively. A minimal increase in code size is also reported.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134521550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Weber, M. Moskewicz, M. Gries, C. Sauer, K. Keutzer
{"title":"Fast cycle-accurate simulation and instruction set generation for constraint-based descriptions of programmable architectures","authors":"S. Weber, M. Moskewicz, M. Gries, C. Sauer, K. Keutzer","doi":"10.1145/1016720.1016728","DOIUrl":"https://doi.org/10.1145/1016720.1016728","url":null,"abstract":"State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. We introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing clement are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycle-accurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115911405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. V. D. Wolf, E. Kock, T. Henriksson, W. Kruijtzer, G. Essink
{"title":"Design and programming of embedded multiprocessors: an interface-centric approach","authors":"P. V. D. Wolf, E. Kock, T. Henriksson, W. Kruijtzer, G. Essink","doi":"10.1109/CODES+ISSS.2004.17","DOIUrl":"https://doi.org/10.1109/CODES+ISSS.2004.17","url":null,"abstract":"We present design technology for the structured design and programming of embedded multi-processor systems. It comprises a task-level interface that can be used both for developing parallel application models and as a platform interface for implementing applications on multi-processor architectures. Associated mapping technology supports refinement of application models towards implementation. By linking application development and implementation aspects, the technology integrates the specification and design phases in the MPSoC design process. Two design cases demonstrate the efficient implementation of the platform interface on different architectures. Industry-wide standardization of a task-level interface can facilitate reuse of function-specific hardware/software modules across companies.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121553776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling operation and microarchitecture concurrency for communication architectures with application to retargetable simulation","authors":"Xinping Zhu, W. Qin, S. Malik","doi":"10.1145/1016720.1016738","DOIUrl":"https://doi.org/10.1145/1016720.1016738","url":null,"abstract":"In multiprocessor based SoCs, optimizing the communication architecture is often as important as, if not more than, optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of computation architectures, the same is not true for the communication architectures. A major challenge in modeling the communication architecture is managing the concurrency at multiple levels: at the operation level, multiple communication operations may be active at any time; at the microarchitecture level, several microarchitectural components may be operating in parallel. Further, it is important to be able to clearly specify how the operation level concurrency maps to the microarchitectural level concurrency. This work presents a modeling methodology and a retargetable simulation framework which fill this gap. This framework seeks to facilitate the design space exploration of the communication sub-system through a rigorous modeling approach based on a formal concurrency model, the operation state machine (OSM). We first introduce the basic notions and concepts of OSM and show by example how this model can be used to represent the inherent concurrency in the architecture and microarchitecture of processors. Then we demonstrate the applicability of OSM in modeling on-chip communication architectures (OCAs) by walking though a router based packet switching network example and a bus example. Due to the fact that the OSM model is naturally suited to handle the operation and microarchitecture level concurrencies of OCAs as well, our OSM-based modeling methodology enables the entire system including both the computation and communication architectures to be modeled in a single OSM framework. This allows us to develop a tool set that can synthesize cycle-accurate system simulators for multi-PE SoC prototypes. To demonstrate the flexibility of this methodology, we choose two distinct system configurations with different types of OCA: a 4/spl times/4 mesh network of 16 PEs, and a cluster of 4 PEs connected by a bus. We show that by simulation, critical system information such as timing and communication patterns can be obtained and evaluated. Consequently, system-level design choices regarding the communication architecture can be made with high confidence in early stages of design. In addition to improving design quality, this methodology also results in significantly shortened design-time.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130963573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analytical models for leakage power estimation of memory array structures","authors":"M. Mamidipaka, K. Khouri, N. Dutt, M. Abadir","doi":"10.1145/1016720.1016757","DOIUrl":"https://doi.org/10.1145/1016720.1016757","url":null,"abstract":"There is a growing need for accurate power models at the system level. Memory structures such as caches, branch target buffers (BTBs), and register files occupy significant area in contemporary SoC designs and are the main contributors to system leakage power dissipation. Existing models for leakage power estimation in array structures typically use coefficients derived from elaborate SPICE simulations. However, these methodologies are not applicable to array designs in a newer technology, that require power estimates early in the design cycle. In this paper, we propose analytical models for array structures that are based only on high level design parameters. Assuming typical circuit implementation styles, we identify the transistors that contribute to the leakage power in each array sub-circuit and develop models as a function of the operation (read/write/idle) on the array and organizational parameters of the array. The developed models are validated by comparing their estimates against the leakage power measured using SPICE simulations on industrial array designs belonging to the e500 processor core. The comparison shows that the models are accurate with an error margin of less than 21.5% and thus can be used in high-level power-performance exploration. Interestingly, in array designs with dual threshold voltage technology, we observed that contrary to the general expectation, the array memory core contributes to just 9% and the address decoder contributes to as much as 62% of the total leakage power.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114645634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A timing-accurate HW/SW cosimulation of an ISS with SystemC","authors":"L. Formaggio, F. Fummi, G. Pravadelli","doi":"10.1145/1016720.1016759","DOIUrl":"https://doi.org/10.1145/1016720.1016759","url":null,"abstract":"The paper presents a system level co-simulation methodology for modeling, validating, and analyzing the performance of embedded systems. The proposed solution relies on the integration between an instruction set simulator (ISS) and the SystemC simulation kernel. In this way, the ISS is used to abstract the model of the real programmable device where the SW should run, while SystemC is used to model HW components that interact with the SW. A correct validation of such an architecture is infeasible without taking care of timing information. Thus, the paper proposes an effective timing synchronization mechanism, which uses timing information of an ISS (or a board) to synchronize the SystemC simulation.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"509 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115889979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiler-directed code restructuring for reducing data TLB energy","authors":"M. Kandemir, I. Kadayif, G. Chen","doi":"10.1145/1016720.1016747","DOIUrl":"https://doi.org/10.1145/1016720.1016747","url":null,"abstract":"Prior work on TLB power optimization considered circuit and architectural techniques. A recent software-based technique for data TLBs has considered the possibility of storing the frequently used virtual-to-physical address translations in a set of translation registers (TRs), and using them when necessary instead of going to the data TLB. This work presents a compiler-based strategy for increasing the effectiveness of TRs. The idea is to restructure the application code in such a fashion that once a TR is loaded, its contents are reused as much as possible. Our experimental evaluation with six array-based benchmarks from the Spec2000 suite indicates that the proposed TR reuse strategy brings significant reductions in data TLB energy over an alternate strategy that employs TRs but does not restructure the code for TR reuse.","PeriodicalId":127038,"journal":{"name":"International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123624966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}