提高科学计算中的负载/存储队列使用率

International Conference on Parallel Processing, 2004. ICPP 2004. Pub Date : 2004-08-15 DOI:10.1109/ICPP.2004.1327902

C. Lemuet, W. Jalby, S. Touati

{"title":"提高科学计算中的负载/存储队列使用率","authors":"C. Lemuet, W. Jalby, S. Touati","doi":"10.1109/ICPP.2004.1327902","DOIUrl":null,"url":null,"abstract":"Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.","PeriodicalId":106240,"journal":{"name":"International Conference on Parallel Processing, 2004. ICPP 2004.","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Improving load/store queues usage in scientific computing\",\"authors\":\"C. Lemuet, W. Jalby, S. Touati\",\"doi\":\"10.1109/ICPP.2004.1327902\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.\",\"PeriodicalId\":106240,\"journal\":{\"name\":\"International Conference on Parallel Processing, 2004. ICPP 2004.\",\"volume\":\"109 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Parallel Processing, 2004. ICPP 2004.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2004.1327902\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Parallel Processing, 2004. ICPP 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2004.1327902","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

内存消歧机制，加上乱序处理器中的负载/存储队列，对于提高指令级并行性(ILP)至关重要，特别是对于内存约束的科学代码。设计理想的内存消歧机制过于复杂，因为它需要精确的地址位比较器;因此，现代微处理器实现了简化和不精确的，只执行部分地址比较。在本文中，我们研究了这种简化对Alpha 21264、Power 4和Itanium 2等实际处理器的持续性能的影响。尽管这些处理器具有所有的高级特性，但我们在本文中证明，内存地址消歧机制可能会导致显著的性能损失。我们证明，即使数据位于较低的缓存级别，并且存在足够的ILP，如果不注意访问独立内存地址的顺序，性能下降可能会慢21倍。而不是提出硬件解决方案来改善负载/存储队列，如[G]。Chrysos等人，(1998)，S. Sethumadhavan等人，(2003)，I. Park等人，(2003)，a . Yoaz等人，(1999)，S. Onder(2002)]，我们表明软件(编译)技术是可能的。这种解决方案是基于经典的(鲁棒的)Id/st矢量化。我们的实验突出了该方法对具有代表性的矢量科学环路的blas1编码的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving load/store queues usage in scientific computing

Memory disambiguation mechanisms, coupled with load/store queues in out-of-order processors, are crucial to increase instruction level parallelism (ILP), especially for memory-bound scientific codes. Designing ideal memory disambiguation mechanisms is too complex because it would require precise address bits comparators; thus, modern microprocessors implement simplified and imprecise ones that perform only partial address comparisons. In this paper, we study the impact of such simplifications on the sustained performance of some real processors such that Alpha 21264, Power 4 and Itanium 2. Despite all the advanced features of these processors, we demonstrate in this article that memory address disambiguation mechanisms can cause significant performance loss. We demonstrate that, even if data are located in low cache levels and enough ILP exist, the performance degradation can be up to 21 times slower if no care is taken on the order of accessing independent memory addresses. Instead of proposing a hardware solution to improve load/store queues, as done in [G. Chrysos et al., (1998), S. Sethumadhavan et al., (2003), I. Park et al., (2003), A. Yoaz et al., (1999), S. Onder (2002)], we show that a software (compilation) technique is possible. Such solution is based on the classical (and robust) Id/st vectorization. Our experiments highlight the effectiveness of such method on BLAS 1 codes that are representative of vector scientific loops.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Parallel Processing, 2004. ICPP 2004.

自引率

0.00%

发文量