{"title":"A model-based methodology for application specific energy efficient data path design using FPGAs","authors":"Sumit Mohanty, S. Choi, Ju-wook Jang, V. Prasanna","doi":"10.1109/ASAP.2002.1030706","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030706","url":null,"abstract":"Presents a methodology to design energy-efficient data paths using FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low level simulation to understand the tradeoffs between energy, latency, and area. The domain specific modeling technique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given application kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate tradeoff analysis. Design space exploration (DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection. We illustrate our methodology using a family of architectures and algorithms for matrix multiplication. The designs identified by our methodology demonstrate tradeoffs among energy, latency, and area.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127702499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-radix logarithm with selection by rounding","authors":"José-Alejandro Piñeiro, M. Ercegovac, J. Bruguera","doi":"10.1109/ASAP.2002.1030708","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030708","url":null,"abstract":"A high-radix digit-recurrence algorithm or the computation of the logarithm is presented in this paper. Selection by rounding is used in iterations j/spl ges/2, and selection by table in the first iteration is combined with a restricted digit-set for the second one, in order to guarantee the convergence of the algorithm. A sequential architecture is proposed. and the execution time and hardware requirements of this architecture are estimated, for a target precision of n=32 bits and a radix r=256. These estimates are obtained according to a rough model for the delay and area cost of the main logic blocks employed, and show the achievement of a speed-up by over 4 times with regard to a conventional radix-2 implementation with redundant arithmetic.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implications of programmable general purpose processors for compression/encryption applications","authors":"Byeong Kil Lee, L. John","doi":"10.1109/ASAP.2002.1030722","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030722","url":null,"abstract":"With the growth of the Internet and mobile communication industry, multimedia applications form a dominant computer workload. Media workloads are typically executed on Application Specific Integrated Circuits (ASICs), application specific processors (ASPs) or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression/decompression algorithms from the perspective of the overhead of executing them on a programmable general purpose processor versus ASPs. We choose nine encode/decode programs from audio, image/video andencryption applications. The instruction mix, memory access and parallelism aspects during the execution of these programs are analyzed. Memory access latency is observed to be the main factor influencing the execution time on general purpose processors. Most of these compression/decompression algorithms involve processing the data through execution phases (e.g. quantization, encoding, etc) and temporary results are stored and retrieved between these phases. A metric called overhead memory-access bandwidth per input/output byte is defined to characterize the temporary memory activity of each application. We observe that more than 90% of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform. We also study the data parallelism in these applications, indicating the ability of instruction level and data level parallel processors to exploit the parallelism in these applications. The parallelism ranges from 6 to 529 in encode processes and 18 to 558 in decode processes.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126913016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient conversion from binary to multi-digit multi-dimensional logarithmic number systems using arrays of range addressable look-up tables","authors":"R. Muscedere, V. Dimitrov, G. Jullien, W. Miller","doi":"10.1109/ASAP.2002.1030711","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030711","url":null,"abstract":"The multi-dimensional logarithmic number system (MDLNS), with similar properties to the logarithmic number system (LNS), provides more degrees of freedom than the LNS by virtue of having two orthogonal bases and the ability to use multiple digits. Unlike the LNS, there is no direct functional relationship between binary/floating point representation and the MDLNS representation. Traditionally look-up tables (LUTs) were used to move from the binary domain to the MDLNS domain. This method can be unrealistic for hardware implementation when large binary ranges or multiple digits are used. This paper introduces a range addressable technique for table look-up arrays that allows efficient conversion from binary to single or multi-digit MDLNS.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122524338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Refining instruction set architecture for high-performance multimedia processing in constrained environments","authors":"R. Lee, A. M. Fiskiran, Z. Shi, Xiao Yang","doi":"10.1109/ASAP.2002.1030724","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030724","url":null,"abstract":"Multimedia processing in software has been significantly accelerated by the addition of subword-parallel instructions to the instruction set architectures (ISAs) of modem microprocessors. While some of these multimedia instructions are simple and effective, others are very complex, requiring large, special-purpose functional units that are not practical for constrained environments such as handheld multimedia information appliances. For such environments, low-power and low-cost are as important as the high performance required for real-time multimedia processing and the general-purpose programmability required to support an ever growing range of applications. In this paper, we introduce PLX, a concise ISA that selects the most useful features from the first two generations of multimedia instructions added to microprocessors, and explores new ISA features for high-performance yet low-cost multimedia processing with small footprint processors. PLX is unique in that it is designed from scratch as a fully subword-parallel architecture with novel features like datapath scalability from 32-bit to 128-bit words, and a new definition of predication for reducing conditional branches. We illustrate the use of PLX's architectural features with four frequently used multimedia kernels: discrete cosine transform, pixel padding, clip test and median filter. Our performance results show that a 64-bit PLX implementation achieves significant speedups compared to a basic 64-bit RISC processor and to IA-32 processors with MMX and SSE multimedia extensions. PLX's datapath scalability feature often provides an additional 2x speedup in a cost-effective way.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133635247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A VLSI architecture for object recognition using tree matching","authors":"K. Sitaraman, N. Ranganathan, A. Ejnioui","doi":"10.1109/ASAP.2002.1030731","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030731","url":null,"abstract":"The problem of tree pattern matching for object recognition in images is computationally intensive in nature. In two-dimensional images, the objects can be represented through multiscale decomposition as tree structures. The pattern tree representing an object can be matched with a subject tree representing an image in order to detect the objects within the image. In this paper, we describe a new systolic algorithm and its realization as a VLSI chip for tree pattern matching. The hardware algorithm is based on a linear array of processing elements (PEs) where the pattern matching is done in a pipelined fashion relying on nearest-neighbor communication between the PEs and the subject and pattern trees of arbitrary length can be processed using a fixed size PE array. The algorithm has an improved execution time of O(/spl lceil/m/a/spl rceil/n) required to perform the matching where in, a and n are the sizes of the pattern tree, processor array, subject tree respectively. A prototype CMOS VLSI chip implementing the proposed algorithm has been designed and verified It is shown that the hardware algorithm proposed in this work represent a significant improvement in terms of computational complexity, data flow, and architecture over the ones previously proposed for this problem.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"85 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131012848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, V. Piuri
{"title":"On the propagation of faults and their detection in a hardware implementation of the Advanced Encryption Standard","authors":"G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, V. Piuri","doi":"10.1109/ASAP.2002.1030729","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030729","url":null,"abstract":"High reliability is a desirable property of any implementation of the Advanced Encryption Standard (AES). To achieve high reliability, all possible faults must be detected to avoid the use and transmission of erroneous encrypted/decrypted data. In this paper we first study the behavior of faults which may occur during the encryption and decryption procedures of AES, and the way such faults eventually propagate to the final result. We then describe an appropriate detection technique for these faults. This work extends our preliminary results (G. Bertoni et al, MPCS 2002) by considering more general fault models (e.g., permanent and multiple transient faults), and the possibility of fault masking.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116296504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A component architecture for FPGA-based, DSP system design","authors":"G. Spivey, S. Bhattacharyya, K. Nakajima","doi":"10.1109/ASAP.2002.1030703","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030703","url":null,"abstract":"Introducing FPGA components into DSP system implementations creates an assortment of challenges across system architecture and logic design. Recognizing that some of the greatest challenges occur in the integration of the various components, we have developed a component architecture and an associated set of software tools, collectively called the Logic Foundry. Using the Logic Foundry, an FPGA-based DSP system can be easily constructed from pre-built components and implemented on a variety of back-end FPGA platforms. The resulting implementation can then be encapsulated and integrated into a variety of front-end software application environments. This paper develops the component architecture and integration capabilities of the Logic Foundry, and examines a number of application case studies that we have experimented with using the Logic Foundry.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123790398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and evaluation of a multimedia computing architecture based on a 3D graphics pipeline","authors":"C. Y. Chung, R. Managuli, Yongmin Kim","doi":"10.1109/ASAP.2002.1030723","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030723","url":null,"abstract":"With the innovation and integration of media objects in multimedia applications, the importance of architectural support for different types of media objects, e.g., image, video and graphics, in one platform has significantly increased. While several approaches based on vector or VLIW (very long instruction word) architectures, e.g., Vector-IRAM and Imagine, have been pursued, they are not as effective as dedicated graphics pipelines for high-performance 3D graphics. We have explored a new programmable computing architecture based on a 3D graphics pipeline, which utilizes dedicated hardware resources in the 3D graphics pipeline for other types of multimedia computing. Adding programmable flexibility to a graphics pipeline for texture mapping has proven to be effective, e.g., pixel shader. However, due to the diversity of imaging and video processing applications, there are several challenges associated with converting a fixed graphics pipeline to a flexible multimedia computing engine. In this paper, we identify the additional architectural requirements, introduce the proposed architecture with extension details, and present the results of the performance evaluation. With cycle-accurate simulation of several benchmark functions, we have verified that the proposed architecture outperforms a modem powerful media processor in imaging and video processing by a factor of 1.3 to 7.5. The 3D graphics performance would not change much because the additional pipeline stages for the extension result in longer pipeline latency but similar throughout.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132272781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A combined interval and floating-point comparator/selector","authors":"A. Akkas","doi":"10.1109/ASAP.2002.1030720","DOIUrl":"https://doi.org/10.1109/ASAP.2002.1030720","url":null,"abstract":"Interval arithmetic provides a robust method for automatically monitoring numerical errors and can be used to solve problems that cannot be efficiently solved with floating-point arithmetic. This paper presents the design and implementation of a combined interval and floating-point comparator/selector, which performs interval intersection, hull, mignitude, magnitude, minimum, maximum, and comparisons, as well as floating-point minimum, maximum and comparisons. Area and delay estimates indicate that the combined interval and floating-point comparator/selector has 98% more area and a worst case delay that is 42% greater than a conventional floating point comparator/selector. The combined interval and floating-point comparator/selector greatly improves the performance of interval selection operations.","PeriodicalId":424082,"journal":{"name":"Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132766188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}