{"title":"A programmable vertex shader with fixed-point SIMD datapath for low power wireless applications","authors":"Ju-Ho Sohn, Ramchan Woo, H. Yoo","doi":"10.1145/1058129.1058144","DOIUrl":"https://doi.org/10.1145/1058129.1058144","url":null,"abstract":"The real time 3D graphics becomes one of the attractive applications for 3G wireless terminals although their battery lifetime and memory bandwidth limit the system resources for graphics processing. Instead of using the dedicated hardware engine with complex functions, we propose an efficient hardware architecture of low power vertex shader with programmability. Our architecture includes the following three features: I) a fixed-point SIMD datapath to exploit parallelism in vertex processing while keeping the power consumption low, II) a multithreaded coprocessor interface to decrease unwanted stalls between the main processor and the vertex shader, reducing power consumption by instruction-level power management, III) a programmable vertex engine to increases the datapath throughput by concurrent operations with main processor. Simulation results show that full 3D geometry pipeline can be performed at 7.2M vertices/sec with 115mW power consumption for polygons using the OpenGL lighting model. The improvement is about 10 times greater than that of the latest graphics core with floating-point datapath for wireless applications in terms of processing speed normalized by power consumption, Kvertices/sec per milliwatt.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115778770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient partitioning of fragment shaders for multiple-output hardware","authors":"Theresa Foley, M. Houston, P. Hanrahan","doi":"10.1145/1058129.1058136","DOIUrl":"https://doi.org/10.1145/1058129.1058136","url":null,"abstract":"Partitioning fragment shaders into multiple rendering passes is an effective technique for virtualizing shading resource limits in graphics hardware. The Recursive Dominator Split (RDS) algorithm is a polynomial-time algorithm for partitioning fragment shaders for real-time rendering that has been shown to generate efficient partitions. RDS does not, however, work for shaders with multiple outputs, and does not optimize for hardware with support for multiple render targets.We present Merging Recursive Dominator Split (MRDS), an extension of the RDS algorithm to shaders with arbitrary numbers of outputs which can efficiently utilize hardware support for multiple render targets, as well as a new cost metric for evaluating the quality of multipass partitions on modern consumer graphics hardware. We demonstrate that partitions generated by our algorithm execute more efficiently than those generated by RDS alone, and that our cost model is effective in predicting the relative performance of multipass partitions.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129626081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Squeeze: numerical-precision-optimized volume rendering","authors":"I. Bitter, N. Neophytou, K. Mueller, A. Kaufman","doi":"10.1145/1058129.1058133","DOIUrl":"https://doi.org/10.1145/1058129.1058133","url":null,"abstract":"This paper discusses how to squeeze volume rendering into as few bits per operation as possible while still retaining excellent image quality. For each of the typical volume rendering pipeline stages in texture map volume rendering, ray casting and splatting we provide a quantitative analysis of the theoretical and practical limits for the required bit precision for computation and storage. Applying this analysis to any volume rendering implementation can balance the internal precisions based on the desired final output precision and can result in significant speedups and reduced memory footprint.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122774234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A flexible simulation framework for graphics architectures","authors":"J. Sheaffer, D. Luebke, K. Skadron","doi":"10.1145/1058129.1058142","DOIUrl":"https://doi.org/10.1145/1058129.1058142","url":null,"abstract":"In this paper we describe a multipurpose tool for analysis of the performance characteristics of computer graphics hardware and software. We are developing Qsilver, a highly configurable micro-architectural simulator of the GPU that uses the Chromium system's ability to intercept and redirect an OpenGL stream. The simulator produces an annotated trace of graphics commands using Chromium, then runs the trace through a cycle-timer model to evaluate time-dependent behaviors of the varios functional units. We demonstrate the use of Qsilver on a simple hypothetical architecture to analyze performance bottlenecks, to explore new GPU microarchitectures, and to model power and leakage properties. One innovation we explore is the use of dynamic voltage scaling across multiple clock domains to achieve significant energy savings at almost negligible performance cost. Finally, we discuss how other architectural features and experiments might be incorporated into the Qsilver framework.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UberFlow: a GPU-based particle engine","authors":"P. Kipfer, Mark E. Segal, R. Westermann","doi":"10.1145/1058129.1058146","DOIUrl":"https://doi.org/10.1145/1058129.1058146","url":null,"abstract":"We present a system for real-time animation and rendering of large particle sets using GPU computation and memory objects in OpenGL. Memory objects can be used both as containers for geometry data stored on the graphics card and as render targets, providing an effective means for the manipulation and rendering of particle data on the GPU.To fully take advantage of this mechanism, efficient GPU realizations of algorithms used to perform particle manipulation are essential. Our system implements a versatile particle engine, including inter-particle collisions and visibility sorting. By combining memory objects with floating-point fragment programs, we have implemented a particle engine that entirely avoids the transfer of particle data at run-time. Our system can be seen as a forerunner of a new class of graphics algorithms, exploiting memory objects or similar concepts on upcoming graphics hardware to avoid bus bandwidth becoming the major performance bottleneck.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128981368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding the efficiency of GPU algorithms for matrix-matrix multiplication","authors":"K. Fatahalian, J. Sugerman, P. Hanrahan","doi":"10.1145/1058129.1058148","DOIUrl":"https://doi.org/10.1145/1058129.1058148","url":null,"abstract":"Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128864517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jörg Schmittler, Sven Woop, Daniel Wagner, Wolfgang J. Paul, Philipp Slusallek
{"title":"Realtime ray tracing of dynamic scenes on an FPGA chip","authors":"Jörg Schmittler, Sven Woop, Daniel Wagner, Wolfgang J. Paul, Philipp Slusallek","doi":"10.1145/1058129.1058143","DOIUrl":"https://doi.org/10.1145/1058129.1058143","url":null,"abstract":"Realtime ray tracing has recently established itself as a possible alternative to the current rasterization approach for interactive 3D graphics. However, the performance of existing software implementations is still severely limited by today's CPUs, requiring many CPUs for achieving realtime performance.In this paper we present a prototype implementation of the full ray tracing pipeline on a single FPGA chip. Running at only 90 MHz it achieves realtime frame rates of 20 to 60 frames per second over a wide range of 3D scenes and includes support for texturing, multiple light sources, and multiple levels of reflection or transparency. A particular interesting feature of the design in the re-use of the transformation unit necessary for supporting dynamic scenes also for other tasks, including efficient ray-triangle intersection as well as shading computations. Despite the additional support for dynamic scenes this approach reduces the overall hardware cost by 68%.We evaluate the design and its implementation across a wide set of example scenes and demonstrate the benefits of dedicated realtime ray tracing hardware.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134045328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Silhouette maps for improved texture magnification","authors":"P. Sen","doi":"10.1145/1058129.1058139","DOIUrl":"https://doi.org/10.1145/1058129.1058139","url":null,"abstract":"Texture mapping is a simple way of increasing visual realism without adding geometrical complexity. Because it is a discrete process, it is important to properly filter samples when the sampling rate of the texture differs from that of the final image. This is particularly problematic when the texture is magnified or minified. While reasonable approaches exist to tackle the minified case, few options exist for improving the quality of magnified textures in real-time applications. Most simply bilinearly interpolate between samples, yielding exceedingly blurry textures. In this paper, we address the real-time magnification problem by extending the silhouette map algorithm to general texturing. In particular, we discuss the creation of these silmap textures as well as a simple filtering scheme that allows for viewing at all levels of magnification. The technique was implemented on current graphics hardware and our results show that we can achieve a level of visual quality comparable to that of a much larger texture.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129622378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tile-based texture mapping on graphics hardware","authors":"Li-Yi Wei","doi":"10.1145/1058129.1058138","DOIUrl":"https://doi.org/10.1145/1058129.1058138","url":null,"abstract":"Texture mapping has been a fundamental feature for commodity graphics hardware. However, a key challenge for texture mapping is how to store and manage large textures on graphics processors. In this paper, we present a tile-based texture mapping algorithm by which we only have to physically store a small set of texture tiles instead of a large texture. Our algorithm generates an arbitrarily large and non-periodic virtual texture map fròm the small set of stored texture tiles. Because we only have to store a small set of tiles, it minimizes the storage requirement to a small constant, regardless of the size of the virtual texture. In addition, the tiles are generated and packed into a single texture map, so that the hardware filtering of this packed texture map corresponds directly to the filtering of the virtual texture. We implement our algorithm as a fragment program, and demonstrate performance on latest graphics processors.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127525103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A quadrilateral rendering primitive","authors":"K. Hormann, M. Tarini","doi":"10.1145/1058129.1058131","DOIUrl":"https://doi.org/10.1145/1058129.1058131","url":null,"abstract":"The only surface primitives that are supported by common graphics hardware are triangles and more complex shapes have to be triangulated before being sent to the rasterizer. Even quadrilaterals, which are frequently used in many applications, are rendered as a pair of triangles after splitting them along either diagonal. This creates an undesirable C1 -discontinuity that is visible in the shading or texture signal. We propose a new method that overcomes this drawback and is designed to be implemented in hardware as a new rasterizer. It processes a potentially non-planar quadrilateral directly without any splitting and interpolates attributes smoothly inside the quadrilateral. This interpolation is based on a recent generalization of barycentric coordinates that we adapted to handle perspective correction and situations in which a quadrilateral is partially behind the point of view.","PeriodicalId":266180,"journal":{"name":"Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122215675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}