M. Rutten, O. P. Gangwal, J. V. Eijndhoven, E. Jaspers, E. Pol
{"title":"Application design trajectory towards reusable coprocessors - MPEG case study","authors":"M. Rutten, O. P. Gangwal, J. V. Eijndhoven, E. Jaspers, E. Pol","doi":"10.1109/ESTMED.2004.1359699","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359699","url":null,"abstract":"This work presents a structured application design trajectory to transform media-processing applications - modeled as Kahn process network - into a set of function-specific hardware units called coprocessors. The proposed design trajectory focuses on identifying hardware-implementable computation kernels that are common for a predetermined set of applications. The design trajectory is exercised in a case study that maps MPEG video decoding and encoding applications onto a set of coprocessors in a heterogeneous multiprocessor architecture. The resulting set of coprocessors can simultaneously perform both encoding and decoding functions for multiple MPEG-2 streams in an estimated 4 mm/sup 2/ (excluding memory) in 0.18 /spl mu/ technology.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122552886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A scalable VLIW for smart imaging","authors":"A. Lundgren, W. Kruijtzer","doi":"10.1109/ESTMED.2004.1359703","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359703","url":null,"abstract":"This work presents a VLIW co-processor template for smart imaging. The template is highly scalable and can easily be instantiated for a specific data level parallelism. The co-processor is built to operate on frame segments instead of full frames only. As a result, eight different instances of the co-processor have been generated, each with different amount of parallelism exploited. Each instance is generated in about 30 minutes using C-based high-level synthesis tools. The generated co-processors have been evaluated and the result shows that the template can be effectively used to balance the area, performance and power consumption with respect to the application requirements.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123109615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongwei Zhu, Karthik Cbandramouli, Yan Yue, F. Balasa
{"title":"Algebraic techniques in the memory size computation of multimedia processing applications","authors":"Hongwei Zhu, Karthik Cbandramouli, Yan Yue, F. Balasa","doi":"10.1109/ESTMED.2004.1359708","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359708","url":null,"abstract":"In real-time multimedia processing systems a very large part of the power consumption is due to the data storage and data transfer. Moreover, the area cost is often largely dominated by memories. Hence, the optimization of the memory architecture is a crucial step in the design methodology for this type of applications. In deriving an optimized memory architecture, memory size computation is an important step in the data transfer and storage exploration stage. This work investigates non-scalar methods for computing the memory size in real-time multimedia algorithms. The approach is based on more recent algebraic techniques specific to the data-flow analysis used in modem compilers. In contrast with previous works which utilize only approximate methods due to the size of the problems (in terms of number of scalars) and single-assignment specifications, this research aims to obtain exact determinations even for large applications.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tool-aided performance analysis and optimization of multimedia applications","authors":"H. Hübert, B. Stabernack, H. Richter","doi":"10.1109/ESTMED.2004.1359717","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359717","url":null,"abstract":"Current embedded system platforms fulfill the requirements of computational and data intensive multimedia applications. However, the software and hardware architecture must be optimized in order use the available resources efficiently and achieve the required real-time performance. We present an analysis tool which aids the system designer during the optimization process by, providing detailed performance and data transfer statistics of the multimedia application. Exemplary optimizations of an H.264 decoder application show how the tool can be utilized.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131135153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Atienza Alonso, S. Mamagkakis, F. Catthoor, J. Mendias, D. Soudris
{"title":"Reducing memory accesses with a system-level design methodology in customized dynamic memory management","authors":"David Atienza Alonso, S. Mamagkakis, F. Catthoor, J. Mendias, D. Soudris","doi":"10.1109/ESTMED.2004.1359716","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359716","url":null,"abstract":"Currently, portable consumer embedded devices are increasing more and more their capabilities and can now implement new algorithms (e.g. multimedia and wireless protocols) that a few years ago were reserved only for powerful workstations. Unfortunately, the original design characteristics of such applications do not often allow to port them directly in current embedded devices. These applications share complex and intensive memory use. Furthermore, they must heavily rely on dynamic memory due to the unpredictability of the input data (e.g. 3D streams features) and system behaviour (e.g. number of applications running concurrently defined by the user). Thus they require that the dynamic memory subsystem involved is able to provide the necessary level of performance for these new dynamic applications. However, actual embedded systems have very limited resources (e.g. speed and power consumed in the memory subsystem) to provide efficient general-purpose dynamic memory management. We propose a new methodology to design custom dynamic memory managers that provide the performance required in new embedded devices by reducing the amount of memory accesses to handle these new dynamic multimedia and wireless network applications. Our results in real-life dynamic applications show significant improvements in memory accesses of dynamic memory managers, i.e. up to 58%, compared to state-of-the-art dynamic memory management solutions for complex applications.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127849257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low energy data and concurrency management of highly dynamic real-time multi-media systems","authors":"F. Catthoor","doi":"10.1109/ESTMED.2004.1359690","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359690","url":null,"abstract":"Summary form only given. The merging of computers, consumer and communication disciplines gives rise to very fast growing markets for personal communication, multi-media and broadband networks, in the information technology (IT) area. Rapid evolution in sub-micron process technology allows ever more complex systems to be mapped on platforms that become integrated on one single platform (system -on-chip). Technology advances are however not followed by an increase in system design productivity. One of the most critical bottlenecks is the very dynamic concurrent behaviour of many of these new applications. They are fully specified in software oriented languages (like Java, UML, SDL, C++) and still need to be executed in real-time cost/energy sensitive way on the heterogeneous SoC platforms. The main issue is that fully design time based solutions as proposed earlier in the compiler and system synthesis cannot solve the problem, and run-time solutions as present in nowadays operating systems are too inefficient in terms of cost optimisation (especially energy consumption) and are also not adapted for the real-time constraints (even RTOS kernels). This dynamic nature is especially emerging because of the quality-of-service (QoS) aspects of these multi-media and networking applications. Prominent examples of this can be found in the recent MPEG4/JPEG2000 standards and especially the new MPEG21 standard. Also the emerging Ambient Intelligence and virtual reality paradigms will stimulate this further. In order to deal with these dynamic issues where tasks and complex data types are created and deleted at run-time based on non-deterministic events, a novel system design paradigm is required. This presentation will focus on the new requirements that result in system-level synthesis. In particular both a \"dynamic data management\" and a \"task concurrency management\" problem formulation will be presented, that have to deal with the very dynamic nature of these systems. The concept of Pareto curve based exploration is crucial in these problem formulations and their solutions.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121537630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A queuing-theoretic performance model for context-flow system-on-chip platforms","authors":"Rami Beidas, Jianwen Zhu","doi":"10.1109/ESTMED.2004.1359697","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359697","url":null,"abstract":"Few analytical performance models that relate performance figure of merit to architectural design decisions are reported in recent studies of network-on-chip, which prevents the development of effective system-level synthesis techniques. We propose an analytical performance model based on queuing theory for a network-on-chip platform recently reported, which features an extremely simple programming model, while providing superior performance measures when compared with alternative architectures. We developed a multi-processor simulation framework, which can simulate an application at the instruction set level given an architecture configuration, to validate the analytical performance model. The accuracy and applicability of the proposed model is illustrated by two real-life applications, namely an SSL security acceleration processor and MP3 decoder.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134113195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hardware accelerator IP for EBCOT tier-1 coding in JPEG2000 standard","authors":"Tien-Wei Hsieh, Y. Lin","doi":"10.1109/ESTMED.2004.1359713","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359713","url":null,"abstract":"We propose a hardware accelerator IP for the Tier-1 portion of Embedded Block Coding with Optimal Truncation (EBCOT) used in the JPEG2000 next generation image compression standard. EBCOT Tier-1 accounts for more than 70% of encoding time due to extensive bit-level processing. Our architecture consists of a 16-way parallel context formation module and a 3-stage pipelined arithmetic encoder. We reduce power consumption by properly shutting down parts of the circuit. Compared with the known best design, we reduce 17% of the cycle count and reach a level within 5% of the theoretical lower bound. We have implemented the design in synthesizable Verilog RTL with an AMBA-AHB interface for SOC design. FPGA prototyping has been successfully demonstrated and substantial speedup achieved.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133270889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High performance visibility testing with screen segmentation","authors":"Péter Szántó, B. Fehér","doi":"10.1109/ESTMED.2004.1359711","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359711","url":null,"abstract":"There are two factors determining the performance a 3D accelerator can achieve: the available computational power and the available memory bandwidth. In embedded systems, these resources are even more limited then in desktop environments, thus the efficiency of the hardware architecture and the exploitation of the logic resources become even more important. Most resources are wasted at the visibility testing process: traditional implementations require a lot of bandwidth, and process pixels which are not visible on the final image. By segmenting the screen, the presented architecture can use high performance, on-chip buffers to lower memory requirements and to provide high performance. The order of the processing guarantees that only those colors are computed, which are truly visible. The modular architecture allows satisfying different requirements: a trade off can be made between the number of processing units and performance.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132284932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data assignment and access scheduling exploration for multi-layer memory architectures","authors":"R. Szymanek, F. Catthoor, K. Kuchcinski","doi":"10.1109/ESTMED.2004.1359707","DOIUrl":"https://doi.org/10.1109/ESTMED.2004.1359707","url":null,"abstract":"This work presents an exploration framework which performs data assignment and access scheduling exploration for applications given a multilayer memory architecture. Our framework uses multiobjective criteria during exploration, such as application execution time, energy, bandwidth, and data size. In order to tackle the complexity of the exploration, it is divided into three phases; Pareto diagram composition, data assignment, and access scheduling. The first phase produces multidimensional Pareto points for our application. After this phase, our framework produces distinct data assignments which are represented as Pareto points in a two dimensional space defined by bandwidth requirements and size requirements. Finally, the scheduling phase finds possibly optimal order of the tasks and performs precise scheduling of the tasks. Three feedbacks paths are present which can be used to iteratively improve exploration results. It is possible to trade off the quality of the results and the algorithm runtime. We have evaluated our framework on a medical image processing application. We have shown that our algorithms can perform exploration of the huge design space in an iterative manner and obtains good Pareto diagram coverage.","PeriodicalId":178984,"journal":{"name":"2nd Workshop onEmbedded Systems for Real-Time Multimedia, 2004. ESTImedia 2004.","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121117701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}