{"title":"gpu上的轻量级软件事务","authors":"Anup Holey, Antonia Zhai","doi":"10.1109/ICPP.2014.55","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) provide an attractive option for extracting data-level parallelism from diverse applications. However, some applications, although possess abundant data-level parallelism, exhibit irregular memory access patterns to the shared data structures. Porting such applications to GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose to support software transactional memory (STM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Software-based transactional execution can incur significant runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this paper we illustrate three lightweight STM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying STM support is able to achieve performance comparable to fine-grained locking.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Lightweight Software Transactions on GPUs\",\"authors\":\"Anup Holey, Antonia Zhai\",\"doi\":\"10.1109/ICPP.2014.55\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) provide an attractive option for extracting data-level parallelism from diverse applications. However, some applications, although possess abundant data-level parallelism, exhibit irregular memory access patterns to the shared data structures. Porting such applications to GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose to support software transactional memory (STM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Software-based transactional execution can incur significant runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this paper we illustrate three lightweight STM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying STM support is able to achieve performance comparable to fine-grained locking.\",\"PeriodicalId\":441115,\"journal\":{\"name\":\"2014 43rd International Conference on Parallel Processing\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 43rd International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2014.55\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 43rd International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2014.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Graphics Processing Units (GPUs) provide an attractive option for extracting data-level parallelism from diverse applications. However, some applications, although possess abundant data-level parallelism, exhibit irregular memory access patterns to the shared data structures. Porting such applications to GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose to support software transactional memory (STM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Software-based transactional execution can incur significant runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this paper we illustrate three lightweight STM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying STM support is able to achieve performance comparable to fine-grained locking.