IEEE Transactions on Computers最新文献

筛选
英文 中文
Pako: Multi-Valued Byzantine Agreement Comparable to Partially-Synchronous BFT
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-12-02 DOI: 10.1109/TC.2024.3510620
Xiaohai Dai;Zhengxuan Guo;Jiang Xiao;Guanxiong Wang;Yifei Liang;Chen Yu;Hai Jin
{"title":"Pako: Multi-Valued Byzantine Agreement Comparable to Partially-Synchronous BFT","authors":"Xiaohai Dai;Zhengxuan Guo;Jiang Xiao;Guanxiong Wang;Yifei Liang;Chen Yu;Hai Jin","doi":"10.1109/TC.2024.3510620","DOIUrl":"https://doi.org/10.1109/TC.2024.3510620","url":null,"abstract":"Asynchronous <i>Byzantine Fault Tolerance</i> (BFT) consensus protocols are gaining attention for their resilience against network attacks. Among them, <i>Multi-valued Byzantine Agreement</i> (MVBA) protocols play a critical role, which accepts input values from each replica and returns a consistent output. The state-of-the-art MVBA protocol, sMVBA, has a good-case latency of <inline-formula><tex-math>$6delta$</tex-math></inline-formula> and an expected bad-case latency of <inline-formula><tex-math>$12delta$</tex-math></inline-formula>, with <inline-formula><tex-math>$delta$</tex-math></inline-formula> representing the network delay. Additionally, sMVBA exhibits a communication of <inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula> in both good and bad cases. Although it outperforms other MVBA protocols, sMVBA still lags behind partially-synchronous counterparts. For instance, PBFT achieves a good-case latency of <inline-formula><tex-math>$3delta$</tex-math></inline-formula>, and HotStuff boasts a good-case communication of <inline-formula><tex-math>$O(n)$</tex-math></inline-formula>. This paper introduces a novel MVBA protocol, Pako, aiming for performance comparable to partially-synchronous protocols. Pako leverages an existing MVBA protocol as a black box and introduces an additional view with an optimistic path to commit values efficiently. Two Pako variants, Pako1 and Pako2, provide a trade-off between latency and communication. To be more specific, Pako1 achieves a good-case latency of <inline-formula><tex-math>$3delta$</tex-math></inline-formula> with <inline-formula><tex-math>$O(n^{2})$</tex-math></inline-formula> communication, while Pako2 reduces the communication to <inline-formula><tex-math>$O(n)$</tex-math></inline-formula> with a slightly higher good-case latency of <inline-formula><tex-math>$5delta$</tex-math></inline-formula>. A series of experiments demonstrate Pako's significant outperformance of counterparts.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"887-900"},"PeriodicalIF":3.6,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10772576","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RuYi: Optimizing Burst Buffer Through Automated, Fine-Grained Process-to-BB Mapping
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-12-02 DOI: 10.1109/TC.2024.3510624
Yusheng Hua;Xuanhua Shi;Ligang He;Kang He;Teng Zhang;Hai Jin;Yong Chen
{"title":"RuYi: Optimizing Burst Buffer Through Automated, Fine-Grained Process-to-BB Mapping","authors":"Yusheng Hua;Xuanhua Shi;Ligang He;Kang He;Teng Zhang;Hai Jin;Yong Chen","doi":"10.1109/TC.2024.3510624","DOIUrl":"https://doi.org/10.1109/TC.2024.3510624","url":null,"abstract":"Current supercomputers use an SSD-based storage layer called Burst Buffer (BB) to provide I/O-intensive applications with accelerated storage access. However, efficiently utilizing this limited and expensive storage remains a critical issue, creating an urgent need for implementing Quality of Service (QoS) in BB. To address this, we propose RuYi, a QoS-aware method to provide applications with bandwidth guarantees in the BB file system. RuYi tackles two main issues. First, it quantitatively profiles available bandwidth resources in BB to ensure reliable QoS, a crucial aspect seldom studied in the literature. Second, RuYi offers fine-grained process-level QoS via an innovative process-to-BB mapping, maximizing resource utilization—something not achievable with conventional coarse-grained compute-to-BB mapping. We evaluated RuYi on a subsystem of the leading exascale supercomputer Sunway, consisting of 4,000 compute nodes and 200 BB nodes. The experimental results demonstrate that RuYi achieves an impressive end-to-end bandwidth control accuracy of 97%, while improving BB utilization by up to 116% compared to conventional coarse-grained compute-to-BB mapping.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"955-967"},"PeriodicalIF":3.6,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10772616","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nicaea: A Byzantine Fault Tolerant Consensus Under Unpredictable Message Delivery Failures for Parallel and Distributed Computing
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-27 DOI: 10.1109/TC.2024.3506856
Guanlin Jing;Yifei Zou;Minghui Xu;Yanqiang Zhang;Dongxiao Yu;Zhiguang Shan;Xiuzhen Cheng;Rajiv Ranjan
{"title":"Nicaea: A Byzantine Fault Tolerant Consensus Under Unpredictable Message Delivery Failures for Parallel and Distributed Computing","authors":"Guanlin Jing;Yifei Zou;Minghui Xu;Yanqiang Zhang;Dongxiao Yu;Zhiguang Shan;Xiuzhen Cheng;Rajiv Ranjan","doi":"10.1109/TC.2024.3506856","DOIUrl":"https://doi.org/10.1109/TC.2024.3506856","url":null,"abstract":"Byzantine fault-tolerant (BFT) consensus is a critical problem in parallel and distributed computing systems, particularly with potential adversaries. Most prior work on BFT consensus assumes reliable message delivery and tolerates arbitrary failures of up to <inline-formula><tex-math>$frac{n}{3}$</tex-math></inline-formula> nodes out of <inline-formula><tex-math>$n$</tex-math></inline-formula> total nodes. However, many systems face unpredictable message delivery failures. This paper investigates the impact of unpredictable message delivery failures on the BFT consensus problem. We propose Nicaea, a novel protocol enabling consensus among loyal nodes when the number of Byzantine nodes is below a new threshold, given by: <inline-formula><tex-math>$frac{left(2-rhoright)left(1-rhoright)^{2n-2}-1}{left(2-rhoright) left(1-rhoright)^{2n-2}+1}n$</tex-math></inline-formula>, where <inline-formula><tex-math>$rho$</tex-math></inline-formula> denotes the message failure rate. Theoretical proofs and experimental results validate Nicaea's Byzantine resilience. Our findings reveal a fundamental trade-off: as message delivery instability increases, a system's tolerance to Byzantine failures decreases. The well-known <inline-formula><tex-math>$frac{n}{3}$</tex-math></inline-formula> threshold under reliable message delivery is a special case of our generalized threshold when <inline-formula><tex-math>$rho=0$</tex-math></inline-formula>. To the best of our knowledge, this work presents the first quantitative characterization of unpredictable message delivery failures’ impact on Byzantine fault tolerance in parallel and distributed computing.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"915-928"},"PeriodicalIF":3.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feynman Meets Turing: The Uncomputability of Quantum Gate-Circuit Emulation and Concatenation
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-27 DOI: 10.1109/TC.2024.3506861
Holger Boche;Yannik N. Böck;Zoe Garcia del Toro;Frank H. P. Fitzek
{"title":"Feynman Meets Turing: The Uncomputability of Quantum Gate-Circuit Emulation and Concatenation","authors":"Holger Boche;Yannik N. Böck;Zoe Garcia del Toro;Frank H. P. Fitzek","doi":"10.1109/TC.2024.3506861","DOIUrl":"https://doi.org/10.1109/TC.2024.3506861","url":null,"abstract":"We investigate the feasibility of computing quantum gate-circuit emulation (QGCE) and quantum gate-circuit concatenation (QGCC) on digital hardware. QGCE serves the purpose of rewriting gate circuits comprised of gates from a varying input gate set to gate circuits formed of gates from a fixed target gate set. Analogously, QGCC serves the purpose of finding an approximation to the concatenation of two arbitrary elements of a varying list of input gate circuits in terms of another element from the same list. Problems of this kind occur regularly in quantum computing and are often assumed an easy task for the digital computers controlling the quantum hardware. Arguably, this belief is due to analogical reasoning: The classical Boolean equivalents of QGCE and QGCC are natively computable on digital hardware. In the present paper, we present two insights in this regard: Upon applying a rigorous theory of computability, QGCE and QGCC turn out to be uncomputable on digital hardware. The results remain valid when we restrict the set of feasible inputs for the relevant functions to one parameter families of fixed gate sets. Our results underline the possibility that several ideas from quantum-computing theory may require a rethinking to become feasible for practical implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"1053-1065"},"PeriodicalIF":3.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10770186","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Humas: A Heterogeneity- and Upgrade-Aware Microservice Auto-Scaling Framework in Large-Scale Data Centers
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-27 DOI: 10.1109/TC.2024.3506862
Qin Hua;Dingyu Yang;Shiyou Qian;Jian Cao;Guangtao Xue;Minglu Li
{"title":"Humas: A Heterogeneity- and Upgrade-Aware Microservice Auto-Scaling Framework in Large-Scale Data Centers","authors":"Qin Hua;Dingyu Yang;Shiyou Qian;Jian Cao;Guangtao Xue;Minglu Li","doi":"10.1109/TC.2024.3506862","DOIUrl":"https://doi.org/10.1109/TC.2024.3506862","url":null,"abstract":"An effective auto-scaling framework is essential for microservices to ensure performance stability and resource efficiency under dynamic workloads. As revealed by many prior studies, the key to efficient auto-scaling lies in accurately learning performance patterns, i.e., the relationship between performance metrics and workloads in data-driven schemes. However, we notice that there are two significant challenges in characterizing performance patterns for large-scale microservices. Firstly, diverse microservices demonstrate varying sensitivities to heterogeneous machines, causing difficulty in quantifying the performance difference in a fixed manner. Secondly, frequent version upgrades of microservices result in uncertain changes in performance patterns, known as pattern drifts, leading to imprecise resource capacity estimation issues. To address these challenges, we propose Humas, a heterogeneity- and upgrade-aware auto-scaling framework for large-scale microservices. Firstly, Humas quantifies the difference in resource efficiency among heterogeneous machines for various microservices online and normalizes their resources in standard units. Additionally, Humas develops a least-squares density-difference (LSDD) based algorithm to identify pattern drifts caused by upgrades. Lastly, Humas generates capacity adjustment plans for microservices based on the latest performance patterns and predicted workloads. The experiment results conducted on 50 real microservices with over 11,000 containers demonstrate that Humas improves resource efficiency and performance stability by approximately 30.4% and 48.0%, respectively, compared to state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"968-982"},"PeriodicalIF":3.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Parallel Computing Scheme Utilizing Memristor Crossbars for Fast Corner Detection and Rotation Invariance in the ORB Algorithm
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-22 DOI: 10.1109/TC.2024.3504817
Qinghui Hong;Haoyou Jiang;Pingdan Xiao;Sichun Du;Tao Li
{"title":"A Parallel Computing Scheme Utilizing Memristor Crossbars for Fast Corner Detection and Rotation Invariance in the ORB Algorithm","authors":"Qinghui Hong;Haoyou Jiang;Pingdan Xiao;Sichun Du;Tao Li","doi":"10.1109/TC.2024.3504817","DOIUrl":"https://doi.org/10.1109/TC.2024.3504817","url":null,"abstract":"The Oriented FAST and Rotated BRIEF (ORB) algorithm plays a crucial role in rapidly extracting image keypoints. However, in the domain of high-frame-rate real-time applications, the algorithm faces challenges of the speed and computational efficiency with the increase in both the size and quantity of images. To address this issue, an ORB algorithm accelerator based on a computing-in-memory (CIM) circuit is firstly proposed in this paper, which replaces the iterative calculations in traditional methods with one-step parallel analog computation. The proposed accelerator improves algorithm computational efficiency through CIM technology and enhances algorithm speed through parallel computation. Simulation demonstrate that the proposed method exhibits an average processing speed 22 <inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster than traditional methods and obtains more uniform corners distribution in large-scale images.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"996-1010"},"PeriodicalIF":3.6,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mix-GEMM: Extending RISC-V CPUs for Energy-Efficient Mixed-Precision DNN Inference Using Binary Segmentation
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-21 DOI: 10.1109/TC.2024.3500369
Jordi Fornt;Enrico Reggiani;Pau Fontova-Musté;Narcís Rodas;Alessandro Pappalardo;Osman Sabri Unsal;Adrián Cristal Kestelman;Josep Altet;Francesc Moll;Jaume Abella
{"title":"Mix-GEMM: Extending RISC-V CPUs for Energy-Efficient Mixed-Precision DNN Inference Using Binary Segmentation","authors":"Jordi Fornt;Enrico Reggiani;Pau Fontova-Musté;Narcís Rodas;Alessandro Pappalardo;Osman Sabri Unsal;Adrián Cristal Kestelman;Josep Altet;Francesc Moll;Jaume Abella","doi":"10.1109/TC.2024.3500369","DOIUrl":"https://doi.org/10.1109/TC.2024.3500369","url":null,"abstract":"Efficiently computing Deep Neural Networks (DNNs) has become a primary challenge in today's computers, especially on devices targeting mobile or edge applications. Recent progress on Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) has shown that the key to high energy efficiency lies in executing deep learning models with low- (8- to 5-bit) or ultra-low-precision (4- to 2-bit). Unfortunately, current Central Processing Unit (CPU) architectures and Instruction Set Architectures (ISAs) present severe limitations on the range of data sizes supported to compute DNN kernels. In this work, we present <i>Mix-GEMM</i>, a hardware-software co-designed architecture that enables RISC-V processors to efficiently compute arbitrary mixed-precision DNN kernels, supporting all data size combinations from 8- to 2-bit. By applying <i>binary segmentation</i>, our architecture can scale its throughput by decreasing the data size of the operands, resulting in a flexible approach capable of leveraging state-of-the-art QAT and PTQ to achieve high energy efficiency at a very low cost. Evaluating our <i>Mix-GEMM</i> architecture in a dual-issue in-order RISC-V processor shows that we are able to boost its performance and energy efficiency by up to <inline-formula><tex-math>$44times$</tex-math></inline-formula> and <inline-formula><tex-math>$11times$</tex-math></inline-formula> with respect to the baseline processor, with an area overhead of only 2%. This allows our extended processor to execute state-of-the-art DNNs with significantly higher performance and energy efficiency than the standard FP32 precision, while retaining almost the same model accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"582-596"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Access-Pattern Hiding Search Over Encrypted Databases by Using Distributed Point Functions
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-21 DOI: 10.1109/TC.2024.3504288
Hongcheng Xie;Yu Guo;Yinbin Miao;Xiaohua Jia
{"title":"Access-Pattern Hiding Search Over Encrypted Databases by Using Distributed Point Functions","authors":"Hongcheng Xie;Yu Guo;Yinbin Miao;Xiaohua Jia","doi":"10.1109/TC.2024.3504288","DOIUrl":"https://doi.org/10.1109/TC.2024.3504288","url":null,"abstract":"Encrypted databases have been extensively studied with the increasing concern of data privacy in cloud services. For practical efficiency, most encrypted database systems are built under Dynamic Searchable Symmetric Encryption (DSSE) schemes to support fast query and update over encrypted data. However, DSSE schemes allow leakages in their security frameworks, especially access-pattern leakages (i.e., the search results corresponding to queried keywords), which lead to various attacks to infer sensitive information of queries and databases. Existing oblivious-access techniques, such as Oblivious RAM and differential privacy, suffer from excessive communication overhead and loss of query accuracy. In this paper, we propose a new DSSE scheme that enables access-pattern hiding keyword search and update operations. Servers can obliviously query and update databases within only a single communication round. Our building block is based on the Distributed Point Function (DPF), an advanced secret sharing technique that provides provable security guarantees against adversaries with arbitrary background knowledge. Moreover, we devise a novel update protocol that integrates DPF and Somewhat Homomorphic Encryption (SHE) such that servers can obliviously update their local data. We formally analyze the security and implement the prototype. The comprehensive experimental results demonstrate the security and efficiency of our scheme.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"1066-1078"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate and Reliable Energy Measurement and Modelling of Data Transfer Between CPU and GPU in Parallel Applications on Heterogeneous Hybrid Platforms
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-21 DOI: 10.1109/TC.2024.3504262
Hafiz Adnan Niaz;Ravi Reddy Manumachu;Alexey Lastovetsky
{"title":"Accurate and Reliable Energy Measurement and Modelling of Data Transfer Between CPU and GPU in Parallel Applications on Heterogeneous Hybrid Platforms","authors":"Hafiz Adnan Niaz;Ravi Reddy Manumachu;Alexey Lastovetsky","doi":"10.1109/TC.2024.3504262","DOIUrl":"https://doi.org/10.1109/TC.2024.3504262","url":null,"abstract":"Developing energy-efficient software that leverages application-level energy optimization techniques is essential to tackle the pressing technological challenge of energy efficiency on modern heterogeneous computing platforms. While energy modelling and optimization of computations have received considerable attention in energy research, there remains a significant gap in the energy modelling of data transfer between computing devices on heterogeneous hybrid platforms. Our study aims to fill this crucial gap. In this work, we comprehensively study the energy consumption of data transfer between a host CPU and a GPU accelerator on heterogeneous hybrid platforms using the three mainstream energy measurement methods: (a) System-level physical measurements based on external power meters (ground-truth), (b) Measurements using on-chip power sensors, and (c) Energy predictive models. The ground-truth method is accurate but prohibitively time-consuming. While the on-chip sensors in Intel multicore CPU processors are inaccurate, the Nvidia GPU sensors do not capture data transfer activity. Therefore, we focus on the third approach and propose a novel methodology to select a small subset of performance events that effectively capture all the energy consumption activities during a data transfer and develop accurate linear energy predictive models employing the shortlisted performance events. Finally, we develop independent and accurate runtime pluggable software energy sensors based on our proposed energy predictive models that employ disjoint sets of performance events to estimate the dynamic energy of computations and data transfers. We employ the sensors to predict the energy consumption of computations and data transfer between a host CPU and two A40 Nvidia GPUs in three parallel scientific applications, and the high accuracy (average prediction error of 5%) of our sensors’ predictions further underscores their practical relevance.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"1011-1024"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10761967","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lock-Free Triangle Counting on GPU
IF 3.6 2区 计算机科学
IEEE Transactions on Computers Pub Date : 2024-11-21 DOI: 10.1109/TC.2024.3504295
Zhigao Zheng;Guojia Wan;Jiawei Jiang;Chuang Hu;Hao Liu;Shahid Mumtaz;Bo Du
{"title":"Lock-Free Triangle Counting on GPU","authors":"Zhigao Zheng;Guojia Wan;Jiawei Jiang;Chuang Hu;Hao Liu;Shahid Mumtaz;Bo Du","doi":"10.1109/TC.2024.3504295","DOIUrl":"https://doi.org/10.1109/TC.2024.3504295","url":null,"abstract":"Finding the triangles of large scale graphs is a fundamental graph mining task in many applications, such as motif detection, microscopic evolution, and link prediction. The recent works on triangle counting can be classified into merge-based or binary search-based paradigms. The merge-based triangle counting paradigm locates the triangles using the set intersection operation, which suffers from the random memory access problem. The binary search-based triangle counting paradigm sets the neighbors of the source vertex of an edge as the lookup array and searches the neighbors of the destination vertex. There are lots of expensive lock operations needed in the binary search-based paradigm, which leads to low thread efficiency. In this paper, we aim to improve the triangle counting efficiency on GPU by designing a lock-free policy named Skiff to implement a hash-based triangle counting algorithm. In Skiff, we first design a hash trie data layout to meet the coalesced memory access model and then propose a lock-free policy to reduce the conflicts of the hash trie. In addition, we use a level array to manage the index of the hash trie to make sure the nodes of the hash trie can be quickly located. Furthermore, we implement a CTA thread organization model to reduce the load imbalance of the real-world graphs. We conducted extensive experiments on NVIDIA GPUs to show the performance of Skiff. The results show that Skiff can achieve a good system performance improvement than the state-of-the-art (SOTA) works.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 3","pages":"1040-1052"},"PeriodicalIF":3.6,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143388489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信