Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, H. Lee
{"title":"GPUMech:基于区间分析的GPU性能建模技术","authors":"Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, H. Lee","doi":"10.1109/MICRO.2014.59","DOIUrl":null,"url":null,"abstract":"GPU has become a first-order computing plat-form. Nonetheless, not many performance modeling techniques have been developed for architecture studies. Several GPU analytical performance models have been proposed, but they mostly target application optimizations rather than the study of different architecture design options. Interval analysis is a relatively accurate performance modeling technique, which traverses the instruction trace and uses functional simulators, e.g., Cache simulator, to track the stall events that cause performance loss. It shows hundred times of speedup compared to detailed timing simulations and better accuracy compared to pure analytical models. However, previous techniques are limited to CPUs and not applicable to multithreaded architectures. In this work, we propose GPU Mech, an interval analysis-based performance modeling technique for GPU architectures. GPU Mech models multithreading and resource contentions caused by memory divergence. We compare GPU Mech with a detailed timing simulator and show that on average, GPU Mechhas 13.2% error for modeling the round-robin scheduling policy and 14.0% error for modeling the greedy-then-oldest policy while achieving a 97x faster simulation speed. In addition, GPU Mech generates CPI stacks, which help hardware/software developers to visualize performance bottlenecks of a kernel.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"9 1","pages":"268-279"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"GPUMech: GPU Performance Modeling Technique Based on Interval Analysis\",\"authors\":\"Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, H. Lee\",\"doi\":\"10.1109/MICRO.2014.59\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"GPU has become a first-order computing plat-form. Nonetheless, not many performance modeling techniques have been developed for architecture studies. Several GPU analytical performance models have been proposed, but they mostly target application optimizations rather than the study of different architecture design options. Interval analysis is a relatively accurate performance modeling technique, which traverses the instruction trace and uses functional simulators, e.g., Cache simulator, to track the stall events that cause performance loss. It shows hundred times of speedup compared to detailed timing simulations and better accuracy compared to pure analytical models. However, previous techniques are limited to CPUs and not applicable to multithreaded architectures. In this work, we propose GPU Mech, an interval analysis-based performance modeling technique for GPU architectures. GPU Mech models multithreading and resource contentions caused by memory divergence. We compare GPU Mech with a detailed timing simulator and show that on average, GPU Mechhas 13.2% error for modeling the round-robin scheduling policy and 14.0% error for modeling the greedy-then-oldest policy while achieving a 97x faster simulation speed. In addition, GPU Mech generates CPI stacks, which help hardware/software developers to visualize performance bottlenecks of a kernel.\",\"PeriodicalId\":6591,\"journal\":{\"name\":\"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture\",\"volume\":\"9 1\",\"pages\":\"268-279\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MICRO.2014.59\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GPUMech: GPU Performance Modeling Technique Based on Interval Analysis
GPU has become a first-order computing plat-form. Nonetheless, not many performance modeling techniques have been developed for architecture studies. Several GPU analytical performance models have been proposed, but they mostly target application optimizations rather than the study of different architecture design options. Interval analysis is a relatively accurate performance modeling technique, which traverses the instruction trace and uses functional simulators, e.g., Cache simulator, to track the stall events that cause performance loss. It shows hundred times of speedup compared to detailed timing simulations and better accuracy compared to pure analytical models. However, previous techniques are limited to CPUs and not applicable to multithreaded architectures. In this work, we propose GPU Mech, an interval analysis-based performance modeling technique for GPU architectures. GPU Mech models multithreading and resource contentions caused by memory divergence. We compare GPU Mech with a detailed timing simulator and show that on average, GPU Mechhas 13.2% error for modeling the round-robin scheduling policy and 14.0% error for modeling the greedy-then-oldest policy while achieving a 97x faster simulation speed. In addition, GPU Mech generates CPI stacks, which help hardware/software developers to visualize performance bottlenecks of a kernel.