{"title":"Addressing mode driven low power data caches for embedded processors","authors":"R. Peri, John Fernando, R. Kolagotla","doi":"10.1145/1054943.1054961","DOIUrl":"https://doi.org/10.1145/1054943.1054961","url":null,"abstract":"The size and speed of first-level caches and SRAMs of embedded processors continue to increase in response to demands for higher performance. In power-sensitive devices like PDAs and cellular handsets, decreasing power consumption while increasing performance is desirable. Contemporary caches typically exploit locality in memory access patterns but do not exploit locality information encoded in addressing modes used to access memory. We present two schemes that use locality information inherent in memory addressing modes to reduce power consumption of cache or SRAM nearest to the processor. The level-0 data buffer scheme introduces a set of data buffers controlled by the addressing mode to eliminate over a third of all reads to the next level of memory (cache or SRAM). These buffers can also reduce load-use penalty in processors with long load pipelines. The address register tag-buffer scheme exploits the addressing mode to reduce tag array look-up in set associative first-level caches.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124957609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A study of performance impact of memory controller features in multi-processor server environment","authors":"C. Natarajan, Bruce Christenson, F. Briggs","doi":"10.1145/1054943.1054954","DOIUrl":"https://doi.org/10.1145/1054943.1054954","url":null,"abstract":"With the growing imbalance between processor and memory performance it becomes more and more important to optimize the memory controller features to obtain the maximum possible performance out of the memory subsystem. This paper presents a study of the performance impact of several memory controller features in multi-processor (MP) server environments that use a DDR/DDR2 based memory subsystem. The results from our studies show that significant performance improvements can be obtained by carefully optimizing the memory controller features. For instance, one of our studies shows that in a system with an in-order shared bus connecting the CPUs and memory controller, an intelligent read-to-write switching memory controller feature can provide the same order of benefit as doubling the number of interleaved memory ranks. Another study shows that much lower average loaded read latency across a wider range of throughput can be obtained by a delayed write scheduling feature.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133344514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Opie compiler from row-major source to Morton-ordered matrices","authors":"Steven T. Gabriel, David S. Wise","doi":"10.1145/1054943.1054962","DOIUrl":"https://doi.org/10.1145/1054943.1054962","url":null,"abstract":"The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalent codes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance we seek to compile a library of usable code to support future development of new algorithms better suited to Morton-ordered matrices.This paper reports the formalism behind the OPIE compiler for C, its status: now compiling several standard Level-2 and Level-3 linear algebra operations, and a demonstration of a breakthrough reflected in a huge reduction of L1, L2, TLB misses. Overall performance improves on the Intel Xeon architecture.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"47 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130807587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cache organizations for clustered microarchitectures","authors":"José González, Fernando Latorre, Antonio González","doi":"10.1145/1054943.1054950","DOIUrl":"https://doi.org/10.1145/1054943.1054950","url":null,"abstract":"Clustered microarchitectures are an effective organization to deal with the problem of wire delays and complexity by partitioning some of the processor resources. The organization of the data cache is a key factor in these processors due to its effect on cache miss rate and inter-cluster communications. This paper investigates alternative designs of the data cache: centralized, distributed, replicated and physically distributed cache architectures are analyzed. Results show similar average performance but significant performance variations depending on the application features, specially cache miss ratio and communications. In addition, we also propose a novel instruction steering scheme in order to reduce communications. This scheme conditionally stalls the dispatch of instructions depending on the occupancy of the clusters, whenever the current instruction cannot be steered to the cluster holding most of the inputs. This new steering outperforms traditional schemes. Results show, an average speedup of 5% and up to 15% for some applications.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133599574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analytical model for software-only main memory compression","authors":"I. Tuduce, T. Gross","doi":"10.1145/1054943.1054958","DOIUrl":"https://doi.org/10.1145/1054943.1054958","url":null,"abstract":"Many applications with large data spaces that cannot run on a typical workstation (due to page faults) call for techniques to expand the effective memory size. One such technique is memory compression.Understanding what applications under what conditions can benefit from main memory compression is complicated due to various tradeoffs and the dynamic characteristics of applications. For instance, a large area to store compressed data increases the effective memory size considerably but also decreases the amount of memory that can hold uncompressed data.This paper presents an analytical model that states the conditions for a compressed-memory system to yield performance improvements. Parameters of the model are the compression algorithm efficiency, the amount of data being compressed, and the application memory access pattern. Such a model can be used by an operating system to compute the size of the compressed-memory level that can improve an application's performance.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127418165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, P. Kogge
{"title":"A low cost, multithreaded processing-in-memory system","authors":"J. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, P. Kogge","doi":"10.1145/1054943.1054946","DOIUrl":"https://doi.org/10.1145/1054943.1054946","url":null,"abstract":"This paper discusses die cost vs. performance tradeoffs for a PIM system that could serve as the memory system of a host processor. For an increase of less than twice the cost of a commodity DRAM part, it is possible to realize a performance speedup of nearly a factor of 4 on irregular applications. This cost efficiency derives from developing a custom multithreaded processor architecture and implementation style that is well-suited for embedding in a memory. Specifically, it takes advantage of the low latency and high row bandwidth to both simplify processor design --- reducing area --- as well as to improve processing throughput. To support our claims of cost and performance, we have used simulation, analysis of existing chips, and also designed and fully implemented a prototype chip, PIM Lite.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123655169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A compressed memory hierarchy using an indirect index cache","authors":"Erik G. Hallnor, S. Reinhardt","doi":"10.1145/1054943.1054945","DOIUrl":"https://doi.org/10.1145/1054943.1054945","url":null,"abstract":"The large and growing impact of memory hierarchies on overall system performance compels designers to investigate innovative techniques to improve memory-system efficiency. We propose and analyze a memory hierarchy that increases both the effective capacity of memory structures and the effective bandwidth of interconnects by storing and transmitting data in compressed form.Caches play a key role in hiding memory latencies. However, cache sizes are constrained by die area and cost. A cache's effective size can be increased by storing compressed data, if the storage unused by a compressed block can be allocated to other blocks. We use a modified Indirect Index Cache to allocate variable amounts of storage to different blocks, depending on their compressibility.By coupling our compressed cache design with a similarly compressed main memory, we can easily transfer data between these structures in a compressed state, increasing the effective memory bus bandwidth. This optimization further improves performance when bus bandwidth is critical.Our simulation results, using the SPEC CPU2000 benchmarks, show that our design increases performance by up to 225% on some benchmarks while degrading performance in general by no more than 2%, other than a 12% decrease on a single benchmark. Compressed bus transfers alone account for up to 80% of this improvement, with the remainder coming from increased effective cache capacity. As memory latencies increase, our design becomes even more beneficial.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123392756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Takahashi, Masaaki Kondo, T. Boku, D. Takahashi, Hiroshi Nakamura, M. Sato
{"title":"SCIMA-SMP: on-chip memory processor architecture for SMP","authors":"C. Takahashi, Masaaki Kondo, T. Boku, D. Takahashi, Hiroshi Nakamura, M. Sato","doi":"10.1145/1054943.1054960","DOIUrl":"https://doi.org/10.1145/1054943.1054960","url":null,"abstract":"In this paper, we propose a processor architecture with programmable on-chip memory for a high-performance SMP (symmetric multi-processor) node named SCIMA-SMP (Software Controlled Integrated Memory Architecture for SMP) with the intent of solving the performance gap problem between a processor and off-chip memory. With special instructions which enable the explicit data transfer between on-chip memory and off-chip memory, this architecture is able to control the data transfer timing and its granularity by the application program, and the SMP bus is utilized efficiently compared with traditional cache-only architecture. Through the performance evaluation based on clock-level simulation for various HPC applications, we confirmed that this architecture largely reduces the bus access cycle by avoiding redundant data transfer and controlling the granularity of the data movement between on-chip and off-chip memory.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122293411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A localizing directory coherence protocol","authors":"Collin McCurdy, C. Fischer","doi":"10.1145/1054943.1054947","DOIUrl":"https://doi.org/10.1145/1054943.1054947","url":null,"abstract":"User-controllable coherence revives the idea of cooperation between software and hardware in an attempt to bridge the gap between efficient small-scale shared memory machines and massive distributed memory machines. It proposes a new multiprocessor architecture which has both a global address-space and multiple processor-local address-spaces with new memory instructions and a new coherence protocol to manage the dual address-spaces.The purpose of this paper is twofold. First, we solidify the semantics of instruction set extensions that enable \"localization\" -- the act of moving data from the global address-space to a processor's local address-space -- thus clearly defining the requirements for a localizing coherence protocol. Second, we demonstrate the feasibility of localizing coherence by describing the workings of a full-scale directory-based protocol that we have implemented and tested using an existing protocol specification tool.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133592251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable cache memory design for large-scale SMT architectures","authors":"M. Mudawar","doi":"10.1145/1054943.1054952","DOIUrl":"https://doi.org/10.1145/1054943.1054952","url":null,"abstract":"The cache hierarchy design in existing SMT and superscalar processors is optimized for latency, but not for band-width. The size of the L1 data cache did not scale over the past decade. Instead, larger unified L2 and L3 caches were introduced. This cache hierarchy has a high overhead due to the principle of containment. It also has a complex design to maintain cache coherence across all levels. Furthermore, this cache hierarchy is not suitable for future large-scale SMT processors, which will demand high bandwidth instruction and data caches with a large number of ports.This paper suggests the elimination of the cache hierarchy and replacing it with one-level caches for instruction and data. Multiple instruction caches can be used in parallel to scale the instruction fetch bandwidth and the overall cache capacity. A one-level data cache can be split into a number of block-interleaved cache banks to serve multiple memory requests in parallel. An interconnect is used to connect the data cache ports to the different cache banks, thus increasing the data cache access time. This paper shows that large-scale SMTs can tolerate long data cache hit times. It also shows that small line buffers can enhance the performance and reduce the required number of ports to the banked data cache memory.","PeriodicalId":249099,"journal":{"name":"Workshop on Memory Performance Issues","volume":"60 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129723096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}