{"title":"在加载存储队列中缓存值","authors":"D. Nicolaescu, A. Veidenbaum, A. Nicolau","doi":"10.1109/MASCOT.2004.1348315","DOIUrl":null,"url":null,"abstract":"The latency of an L1 data cache continues to grow with increasing clock frequency, cache size and associativity. The increased latency is an important source of performance loss in high-performance processors. The paper proposes to cache data utilizing the load store queue (LSQ) hardware and data paths. Using very little additional hardware, this inexpensive cache improves performance and reduces energy consumption. The modified load store queue \"caches\" all previously accessed data values going beyond existing store-to-load forwarding techniques. Both load and store data are placed in the LSQ and are retained there after a corresponding memory access instruction has been committed. It is shown that a 128-entry modified LSQ design allows an average of 51% of all loads in the SPECint2000 benchmarks to get their data from the LSQ. Up to 7% performance improvement is achieved on SPECint2000 with a 1-cycle LSQ access latency and 3-cycle L1 cache latency. The average speedup is over 4%.","PeriodicalId":32394,"journal":{"name":"Performance","volume":"25 1","pages":"580-587"},"PeriodicalIF":0.0000,"publicationDate":"2004-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Caching values in the load store queue\",\"authors\":\"D. Nicolaescu, A. Veidenbaum, A. Nicolau\",\"doi\":\"10.1109/MASCOT.2004.1348315\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The latency of an L1 data cache continues to grow with increasing clock frequency, cache size and associativity. The increased latency is an important source of performance loss in high-performance processors. The paper proposes to cache data utilizing the load store queue (LSQ) hardware and data paths. Using very little additional hardware, this inexpensive cache improves performance and reduces energy consumption. The modified load store queue \\\"caches\\\" all previously accessed data values going beyond existing store-to-load forwarding techniques. Both load and store data are placed in the LSQ and are retained there after a corresponding memory access instruction has been committed. It is shown that a 128-entry modified LSQ design allows an average of 51% of all loads in the SPECint2000 benchmarks to get their data from the LSQ. Up to 7% performance improvement is achieved on SPECint2000 with a 1-cycle LSQ access latency and 3-cycle L1 cache latency. The average speedup is over 4%.\",\"PeriodicalId\":32394,\"journal\":{\"name\":\"Performance\",\"volume\":\"25 1\",\"pages\":\"580-587\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MASCOT.2004.1348315\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOT.2004.1348315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The latency of an L1 data cache continues to grow with increasing clock frequency, cache size and associativity. The increased latency is an important source of performance loss in high-performance processors. The paper proposes to cache data utilizing the load store queue (LSQ) hardware and data paths. Using very little additional hardware, this inexpensive cache improves performance and reduces energy consumption. The modified load store queue "caches" all previously accessed data values going beyond existing store-to-load forwarding techniques. Both load and store data are placed in the LSQ and are retained there after a corresponding memory access instruction has been committed. It is shown that a 128-entry modified LSQ design allows an average of 51% of all loads in the SPECint2000 benchmarks to get their data from the LSQ. Up to 7% performance improvement is achieved on SPECint2000 with a 1-cycle LSQ access latency and 3-cycle L1 cache latency. The average speedup is over 4%.