{"title":"An Adaptive Block Pinning Cache for Reducing Network Traffic in Multi-core Architectures","authors":"N. Chaturvedi, S. Gurunarayanan","doi":"10.1109/CICN.2013.98","DOIUrl":null,"url":null,"abstract":"With advent of new technologies there is exponential increase in multi-core processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this issue. A NUCA partitions the complete cache memory into smaller multiple banks and allows banks near the processor cores to have lower access latencies than those further away, thus reducing the effects of the cache's internal wire delays. Traditionally, NUCA organizations have been classified as static (S-NUCA) and dynamic (D- NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. In D-NUCA designs a data blocks can migrate towards the processor core that access them most frequently. This migration of data blocks will increase network traffic. The short life time of data blocks and low spatial locality in many applications results in eviction of block with few unused words. This effectively increases miss rate, and waste on chip network bandwidth. Unused word transfers also wastes a large fraction of on chip energy consumption.In this paper, we present an efficient and implementable cache design that eliminate unnecessary coherence traffic and match data movements to an applications spatial locality. It also presents one way to scale on-chip coherence with less costeffective techniques such as shared caches augmented to track cached copies, explicit eviction notification and hierarchal design. Based on our scalability analysis of this cache design we predict that this design consistently reduce miss rate and improve the fraction of data transmitted that is actually utilized by the application.","PeriodicalId":415274,"journal":{"name":"2013 5th International Conference on Computational Intelligence and Communication Networks","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 5th International Conference on Computational Intelligence and Communication Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN.2013.98","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With advent of new technologies there is exponential increase in multi-core processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this issue. A NUCA partitions the complete cache memory into smaller multiple banks and allows banks near the processor cores to have lower access latencies than those further away, thus reducing the effects of the cache's internal wire delays. Traditionally, NUCA organizations have been classified as static (S-NUCA) and dynamic (D- NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. In D-NUCA designs a data blocks can migrate towards the processor core that access them most frequently. This migration of data blocks will increase network traffic. The short life time of data blocks and low spatial locality in many applications results in eviction of block with few unused words. This effectively increases miss rate, and waste on chip network bandwidth. Unused word transfers also wastes a large fraction of on chip energy consumption.In this paper, we present an efficient and implementable cache design that eliminate unnecessary coherence traffic and match data movements to an applications spatial locality. It also presents one way to scale on-chip coherence with less costeffective techniques such as shared caches augmented to track cached copies, explicit eviction notification and hierarchal design. Based on our scalability analysis of this cache design we predict that this design consistently reduce miss rate and improve the fraction of data transmitted that is actually utilized by the application.