{"title":"Contention-conscious transaction ordering in embedded multiprocessors","authors":"M. Khandelia, S. Bhattacharyya","doi":"10.1109/ASAP.2000.862398","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862398","url":null,"abstract":"This paper explores the problem of efficiently ordering interprocessor communication operations in statically-scheduled multiprocessors for iterative dataflow graphs. In most digital signal processing applications, the throughput of the system is significantly affected by communication costs. By explicitly modeling these costs within an effective graph-theoretic analysis framework, we show that ordered transaction schedules can significantly outperform self-timed schedules even when synchronization costs are low. However, we also show that when communication latencies are non-negligible, finding an optimal transaction order given a static schedule is an NP-complete problem, and that this intractability holds both under iterative and non-iterative execution. We develop new heuristics for finding efficient transaction orders, and perform an experimental comparison to gauge the performance of these heuristics.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126765777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Block-update parallel processing QRD-RLS algorithm for throughput improvement with low power consumption","authors":"Lijun Gao, K. Parhi","doi":"10.1109/ASAP.2000.862393","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862393","url":null,"abstract":"In this paper, a block-update parallel processing algorithm is proposed for increasing the throughput of the CORDIC-based QRD-RLS filtering with low power consumption. The proposed algorithm employs single-state-update parallel processing, and with this algorithm, the throughput of a block-by-block weight-update QRD-RLS filter can be increased at the cost of linear increase in hardware resource. However, the proposed algorithm does not change the iteration bounds and clock frequency of the QRD-RLS filters. As a result, the functional units need not be pipelined and the power consumption only increases linearly instead of quadratically. Due to non-pipelining and less power consumption, a higher folding factor can be used for a folding transformation and a great reduction in hardware resource can be achieved without exceeding the physical limitation on pipelining level and power density. Therefore, the proposed algorithm can serve as an important stage in designing and mapping a QRD-RLS filter onto physical hardware or computing resources, and thus is better for both ASIC chip design and parallel computing when block-by-block weight-update is applicable.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127719175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mladen Berekovic, P. Pirsch, T. Selinger, Kai-Immo Wels, C. Miro, A. Lafage, C. Heer, G. Ghigo
{"title":"Architecture of an image rendering co-processor for MPEG-4 systems","authors":"Mladen Berekovic, P. Pirsch, T. Selinger, Kai-Immo Wels, C. Miro, A. Lafage, C. Heer, G. Ghigo","doi":"10.1109/ASAP.2000.862374","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862374","url":null,"abstract":"The TANGRAM VLSI co-processor is intended as a building block for use in system-on-chip (SOC) designs for the versatile MPEG-4 multimedia standard. It is designed to perform the computation intensive final step of MPEG-4 video decoding: compositing of scenes at the display. This includes warping and alpha blending of multiple full-screen video textures in real-lime. TANGRAM consists of a RISC control processor and multiple powerful arithmetic units that perform rendering calculations directly in hardware. This hybrid architecture enables adaptation to changes in algorithms or software support for different video-formats. Communication to a host CPU and video decoding hardware is done via the very common PI-bus on-chip interface. TANGRAM directly interfaces with the ITU-R601/656 digital video output. VHDL implementation and synthesis for a 0.35 /spl mu/ standard-cell library provide an estimate of 100 MHz achievable clock-frequency (worst-case), 52 mm/sup 2/ overall area and 1 Watt power dissipation. TANGRAM has sufficient performance for rendering of MPEG-4 Main Profile@Layer3 scenes (CCIR).","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"39 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131894500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partitioning conditional data flow graphs for embedded system design","authors":"M. Auguin, L. Bianco, Laurent Capella, E. Gresset","doi":"10.1109/ASAP.2000.862404","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862404","url":null,"abstract":"The complexity of embedded applications increases continuously. Integration advances provides a rising range of possibilities to implement a system on a chip. The designers are faced to the difficult challenge to select the right units to implement the application functionalities so that the silicon area is minimized and the time constraints of the application are met. This paper presents an effective method to design system architectures which operates on a conditional data flow graph which is well suited to represent signal processing applications.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"225 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115659248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Booth multiplier accepting both a redundant or a non redundant input with no additional delay","authors":"M. Daumas, D. Matula","doi":"10.1109/ASAP.2000.862391","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862391","url":null,"abstract":"Past recorders have added critical path delay for the more frequent case where both inputs are non redundant. Our proposed circuit does not lengthen the time of one multiplication compared to the state-of-the-art encoding, if both inputs are non redundant. We have slightly modified an existing cell to accept a redundant binary number in place of the non redundant number by changing some connections. The recoding operators associated with a high level quantity (the fraction range) all defined in this paper are used to rule out some possibilities as inputs of this newly created cell. We check that the modified cell yields the correct output for the remaining possible inputs.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125198002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Deprettere, E. Rijpkema, P. Lieverse, B. Kienhuis
{"title":"High level modeling for parallel executions of nested loop algorithms","authors":"E. Deprettere, E. Rijpkema, P. Lieverse, B. Kienhuis","doi":"10.1109/ASAP.2000.862380","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862380","url":null,"abstract":"High level modeling and (quantitative) performance analysis of signal processing systems requires high level models for the applications (algorithms) and the implementations (architecture), a mapping of the former into the latter and a simulator for fast execution of the whole. Signal processing algorithms are very often nested-loop algorithms with a high degree of inherent parallelism. This paper presents-for such applications-suitable application and implementation models, a method to convert a given imperative executable specification to a specification in terms of the application model, a method to map this specification into an architecture specification in terms of the implementation model, and a method to analyze the performance through simulation. The methods and tools ore illustrated by means of an example.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117345814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multiplication-free parallel architecture for affine transformation","authors":"Wael Badawy, M. Bayoumi","doi":"10.1109/ASAP.2000.862375","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862375","url":null,"abstract":"This paper presents a novel low power parallel architecture for computing affine transformation (AT). It is based on a new multiplication-free algorithm that employs the inherent algebraic properties of the AT. Low power has been achieved at the algorithmic level by replacing the multiplication with shifting operation, at the architecture level by using parallel computational units, and at the circuit level by using low power cells. The proposed architecture can be used as a computational kernel in object-based video processing. It is compatible with MPEG-4 and VRML standards. The architecture has been prototyped in 0.6 /spl mu/m CMOS technology with three layers of metal.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132184367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Blume, Hans-Martin Blüthgen, C. Henning, Patrick Osterloh
{"title":"Integration of high-performance ASICs into reconfigurable systems providing additional multimedia functionality","authors":"H. Blume, Hans-Martin Blüthgen, C. Henning, Patrick Osterloh","doi":"10.1109/ASAP.2000.862379","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862379","url":null,"abstract":"The computational power of many future multimedia applications is beyond the capabilities of today's multimedia systems. Therefore, the integration of additional high-performance multimedia components is most decisive. This paper presents the integration of multimedia components into computer systems using reconfigurable coprocessor boards. The goal of these reconfigurable platforms which can be adapted to several applications and which include digital signal processors, controlling and memory devices as well as dedicated multimedia ASICs is worked out. On the way to such a platform four ASICs for image and text processing are presented. The integration of these components into a computing system using a CardBus-based coprocessor board is shown.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115306093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Herbordt, Honghai Zhang, Calvin Lin, H. Rao, J. Cravy
{"title":"Control for high-speed PE arrays","authors":"M. Herbordt, Honghai Zhang, Calvin Lin, H. Rao, J. Cravy","doi":"10.1109/ASAP.2000.862395","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862395","url":null,"abstract":"Although arrays of SIMD PEs can be built with very high operating frequencies, problems exist in keeping the array busy. The inherent mismatch between host and array makes it difficult to maintain high array utilization: either the rate of instruction issue is very low or PE data locality is compromised, having the same effect. Our solution is based on an array control unit (ACU) design that expands macro instructions in two stages, first by data tile and then into microinstructions. The expansion itself solves the issue problem; decoupling the expansion modalities maintains data locality. Several issues involving host/ACU interaction need to be resolved to effect this solution.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126259505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tradeoff analysis and architecture design of a hybrid hardware/software sorter","authors":"M. Bednara, O. Beyer, J. Teich, R. Wanka","doi":"10.1109/ASAP.2000.862400","DOIUrl":"https://doi.org/10.1109/ASAP.2000.862400","url":null,"abstract":"Sorting long sequences of keys is a problem that occurs in many different applications. For embedded systems, a uniprocessor software solution is often not applicable due to the low performance, while realizing multiprocessor sorting methods on parallel computers is much too expensive with respect to power consumption, physical weight, and cost. We investigate cost/performance tradeoffs for hybrid sorting algorithms that use a mixture of sequential merge sort and systolic insertion sort techniques. We propose a scalable architecture for integer sorting that consists of a uniprocessor and an FPGA-based parallel systolic co-processor. Speedups obtained analytically and experimentally and depending on hardware (cost) constraints are determined as a function of time constants of the uniprocessor and the co-processor.","PeriodicalId":387956,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128378219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}