{"title":"PresCount: Effective Register Allocation for Bank Conflict Reduction","authors":"Xiaofeng Guan, Hao Zhou, Guoqing Bao, Handong Li, Liang Zhu, Jianguo Yao","doi":"10.1109/CGO57630.2024.10444841","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444841","url":null,"abstract":"Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects. This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks. Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions. By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"58 1","pages":"170-181"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM","authors":"Yuyou Fan, John Regehr","doi":"10.1109/CGO57630.2024.10444854","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444854","url":null,"abstract":"It is very difficult to thoroughly test a compiler, and as a consequence it is common for released versions of production compilers to contain bugs that cause them to crash and to emit incorrect object code. We created alive-mutate, a mutation-based fuzzing tool that takes test cases written by humans and randomly modifies them, based on the hypothesis that while compiler developers are fundamentally good at writing tests, they also tend to miss corner cases. Alive-mutate is integrated with the Alive2 translation validation tool for LLVM, which is useful because it checks the behavior of optimizations for all possible values of input variables. Alive-mutate is also integrated with the LLVM middle-end, allowing it to perform mutations, optimizations, and formal verification of the optimizations all within a single program—avoiding numerous sources of overhead. Alive-mutate's fuzzing throughput is 12x higher, on average, than a fuzzing workflow that runs mutation, optimization, and formal verification in separate processes. So far we have used alive-mutate to find and report 33 previously unknown bugs in LLVM.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"42 9","pages":"349-358"},"PeriodicalIF":0.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140285705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Seer: Predictive Runtime Kernel Selection for Irregular Problems","authors":"Leon Frenot, Fernando Magno Quintão Pereira","doi":"10.1109/CGO57630.2024.10444812","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444812","url":null,"abstract":"Modern GPUs are designed for regular problems and suffer from load imbalance when processing irregular data. Prior to our work, a domain expert selects the best kernel to map fine-grained irregular parallelism to a GPU. We instead propose Seer, an abstraction for producing a simple, reproduceable, and understandable decision tree selector model which performs runtime kernel selection for irregular workloads. To showcase our framework, we conduct a case study in Sparse Matrix Vector Multiplication (SpMV), in which Seer predicts the best strategy for a given dataset with an improvement of 2× over the best single iteration kernel across the entire SuiteSparse Matrix Collection dataset.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"60 13","pages":"133-142"},"PeriodicalIF":0.0,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140453468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A System-Level Dynamic Binary Translator Using Automatically-Learned Translation Rules","authors":"Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, P. Yew, Weihua Zhang","doi":"10.1109/CGO57630.2024.10444850","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444850","url":null,"abstract":"System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU's helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"144 ","pages":"423-434"},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140456325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, Cedric Bastoul
{"title":"PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler","authors":"Gianpietro Consolaro, Zhen Zhang, Harenome Razanajato, Nelson Lossing, Nassim Tchoulak, Adilla Susungi, Artur Cesar Araujo Alves, Renwei Zhang, Denis Barthou, Corinne Ancourt, Cedric Bastoul","doi":"10.1109/CGO57630.2024.10444791","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444791","url":null,"abstract":"Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80× on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"24 1","pages":"28-40"},"PeriodicalIF":0.0,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140510118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BEC: Bit-Level Static Analysis for Reliability against Soft Errors","authors":"Yousun Ko, Bernd Burgstaller","doi":"10.1109/CGO57630.2024.10444844","DOIUrl":"https://doi.org/10.1109/CGO57630.2024.10444844","url":null,"abstract":"Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data. We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04 % of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11 % (4.94 % on average) through bit-level vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture. To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"51 3","pages":"283-295"},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140510219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}