Parallel Computing最新文献

筛选
英文 中文
LSAF: A load-balancing SpGEMM acceleration framework with dynamic package and static partition for multi-core systolic arrays LSAF:一个负载平衡的SpGEMM加速框架,具有动态封装和静态分区,用于多核收缩阵列
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2026-01-31 DOI: 10.1016/j.parco.2026.103186
Yongxiang Cao, Hongxu Jiang, Guocheng Zhao, Dongcheng Shi, Runhua Zhang, Wei Wang
{"title":"LSAF: A load-balancing SpGEMM acceleration framework with dynamic package and static partition for multi-core systolic arrays","authors":"Yongxiang Cao,&nbsp;Hongxu Jiang,&nbsp;Guocheng Zhao,&nbsp;Dongcheng Shi,&nbsp;Runhua Zhang,&nbsp;Wei Wang","doi":"10.1016/j.parco.2026.103186","DOIUrl":"10.1016/j.parco.2026.103186","url":null,"abstract":"<div><div>Sparse generalized matrix multiplication (SpGEMM) has been widely applied to sparse neural network models. However, the arbitrary distribution of non-zero elements in sparse matrices leads to the processing elements (PEs) in the systolic array (SA) architecture being idle and further affecting computing efficiency. Reviewing existing methods, we found three main drawbacks to calculating SpGEMM in multi-core SAs. First, the sparse matrix calculation format is unsuitable for the SA architecture. Second, when the SA calculates SpGEMM, the load is unbalanced among PEs. Third, the computational load is unevenly distributed across different SA cores during the process above. To address the above problems, we proposed a load-balancing SpGEMM accelerating framework for multi-core SAs. First, we introduced the SCSR sparse matrix compression format and the PE fast sparse matrix matching and calculation method in SA. Second, we present a runtime dynamic data flow packaging algorithm, GrePack. Third, we propose a compile-time sparse data flow multi-core static partitioning algorithm. Compared with the state-of-the-art work, our dynamic packaging algorithm accelerates SpGEMM speed by up to 2.08<span><math><mo>×</mo></math></span>, our static multi-core partitioning method improves the computing unit utilization by up to 1.54<span><math><mo>×</mo></math></span>, and our collaborative inference framework improves SpGEMM speed by up to 2.29<span><math><mo>×</mo></math></span>.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103186"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Microarchitectural comparison, in-core modeling, and memory hierarchy analysis of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa 最先进cpu的微架构比较、内核内建模和内存层次分析:Grace、Sapphire Rapids和Genoa
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2026-01-29 DOI: 10.1016/j.parco.2026.103183
Jan Laukemann , Georg Hager , Gerhard Wellein
{"title":"Microarchitectural comparison, in-core modeling, and memory hierarchy analysis of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa","authors":"Jan Laukemann ,&nbsp;Georg Hager ,&nbsp;Gerhard Wellein","doi":"10.1016/j.parco.2026.103183","DOIUrl":"10.1016/j.parco.2026.103183","url":null,"abstract":"<div><div>Three big semiconductor companies in HPC are currently competing in the race for the best CPU: AMD, Intel, and NVIDIA. There are significant differences among their state-of-the-art CPU designs, spanning the entire range from instruction execution to cache behavior and main memory bandwidth. In this work, we analyze the performance of CPUs based on the Zen 4, Golden Cove, and Neoverse V2 microarchitectures. We create accurate in-core performance models for use with the Open Source Architecture Code Analyzer (OSACA) tool and compare its prediction accuracy with llvm-mca. Beyond the tool aspect, this reveals interesting differences in in-core design points but also some commonalities. Beyond the single core, we extend our comparison by measuring data-transfer behavior through the memory hierarchy using a variety of microbenchmarks. We thoroughly investigate the “write-allocate (WA) evasion” feature, which can automatically reduce the memory traffic caused by write misses. We show that the Grace Superchip has a next-to-optimal implementation of WA evasion while the Sapphire Rapids CPU can avoid write allocates completely only in specific scenarios. The only way to eliminate WAs on AMD Genoa is the explicit use of non-temporal stores. Finally, we study the cache hierarchy of the CPUs in view of the Execution-Cache-Memory (ECM) performance model, revealing overlapping cache hierarchies on Genoa and Grace in contrast to Sapphire Rapids.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103183"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards analysis and refinement of auto-tuning spaces 对自动调谐空间的分析和改进
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2026-01-17 DOI: 10.1016/j.parco.2026.103185
Jiří Filipovič , Suren Harutyunyan Gevorgyan , Eduardo César , Anna Sikora
{"title":"Towards analysis and refinement of auto-tuning spaces","authors":"Jiří Filipovič ,&nbsp;Suren Harutyunyan Gevorgyan ,&nbsp;Eduardo César ,&nbsp;Anna Sikora","doi":"10.1016/j.parco.2026.103185","DOIUrl":"10.1016/j.parco.2026.103185","url":null,"abstract":"<div><div>Source code-level auto-tuning enables applications to adapt their implementation to maintain peak performance under varying execution environments (i.<!--> <!-->e.hardware, input, or application settings). However, the performance of the auto-tuned code is inherently tied to the design of the tuning space (the space of possible changes to the code). An ideal tuning space must include configurations diverse enough to ensure high performance across all targeted environments while simultaneously eliminating redundant or inefficient regions that slow the tuning space search process. Traditional research has focused primarily on identifying optimization opportunities in the code and on efficient tuning space search. However, there is no rigorous methodology or tool supporting analysis and refinement of the tuning spaces, allowing for the addition of configurations that perform well in an unseen environment or the removal of configurations that perform poorly in any realistic environment.</div><div>In this short communication, we argue that hardware performance counters should be used to analyze tuning spaces, and that such an analysis would allow programmers to refine the tuning spaces by adding configurations that unlock additional performance in unseen environments and removing those unlikely to produce efficient code in any realistic environment. While our primary goal is to introduce this research question and foster discussion, we also present a preliminary methodology for tuning-space analysis. We validate our approach through a case study using a GPU implementation of an N-body simulation. Our results demonstrate that the proposed analysis can detect the weaknesses of a tuning space: based on its outcomes, we refined the tuning space, improving the average configuration performance <span><math><mrow><mn>3</mn><mo>.</mo><mn>3</mn><mo>×</mo></mrow></math></span>, and the best-performing configuration by 2–18<span><math><mtext>%</mtext></math></span>.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103185"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146022650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning-driven fault-tolerant core mapping in Network-on-Chip architectures for advanced computing networks 先进计算网络的片上网络架构中机器学习驱动的容错核心映射
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2025-12-01 DOI: 10.1016/j.parco.2025.103167
Challa Muralikrishna Yadav, B. Naresh Kumar Reddy
{"title":"Machine learning-driven fault-tolerant core mapping in Network-on-Chip architectures for advanced computing networks","authors":"Challa Muralikrishna Yadav,&nbsp;B. Naresh Kumar Reddy","doi":"10.1016/j.parco.2025.103167","DOIUrl":"10.1016/j.parco.2025.103167","url":null,"abstract":"<div><div>The utilization of machine learning (ML) in architectural design shows great potential, especially in addressing the challenges posed by complex design spaces where traditional approaches may fall short. Network-on-chip (NoC) architecture has emerged as an efficient solution for on-chip communication among processors. However, with the increasing device scaling and component density, the likelihood of processor failures also rises, making fault-tolerant design a critical aspect of chip development to ensure system reliability. In this paper, we present a novel ML framework for fault-tolerant core mapping that effectively overcomes issues encountered in previous methodologies, such as re-transmission and re-mapping. The proposed framework intelligently learns optimal core mapping strategies and effectively addresses fault tolerance concerns in NoCs with diverse application core graphs. The approach begins with efficient NoC mapping and scheduling as the primary step. In the event of any faults during this process, an error detection and correction mechanism is applied within the NoC itself, eliminating the need for time-consuming re-transmissions. Furthermore, if faults persist even after error correction, the tasks assigned to the failed core are seamlessly migrated to a designated spare core, ensuring continuous system operation. Comparisons with conventional methods demonstrate considerable improvements in processor speed-up, energy efficiency, as well as reductions in re-transmission, latency, and dynamic power consumption. Hardware results indicate enhanced performance, reduced area, and lower power consumption compared to related algorithms when implemented on an FPGA board. The proposed technique showcases significant advancements in fault-tolerant core mapping for NoCs, thereby enhancing overall chip reliability and performance.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103167"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145665693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HRPF: A parallel programming framework for recursive algorithms on heterogeneous CPU–GPU systems 在异构CPU-GPU系统上用于递归算法的并行编程框架
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2026-02-06 DOI: 10.1016/j.parco.2026.103187
Yizhuo Wang , Bowen Liu , Senhao Shao , Jianhua Gao , Weixing Ji , Hongbo Xing
{"title":"HRPF: A parallel programming framework for recursive algorithms on heterogeneous CPU–GPU systems","authors":"Yizhuo Wang ,&nbsp;Bowen Liu ,&nbsp;Senhao Shao ,&nbsp;Jianhua Gao ,&nbsp;Weixing Ji ,&nbsp;Hongbo Xing","doi":"10.1016/j.parco.2026.103187","DOIUrl":"10.1016/j.parco.2026.103187","url":null,"abstract":"<div><div>Recursion, as a common programming paradigm, is widely applied in numerous applications. By treating recursive problems as tasks, the recursion process generates many independent subtasks, which reveals the potential for parallelism. To harness this parallelism on heterogeneous CPU–GPU systems, this paper introduces HRPF (Heterogeneous Recursive Parallel Programming Framework). HRPF provides a set of programming interfaces to define recursive algorithms, shielding users from the complexities of task allocation, scheduling, and data movement. This facilitates the efficient and straightforward implementation of parallel recursive programs on CPU–GPU systems. HRPF dispatches tasks between CPU and GPU workers by combining depth-first search (DFS) and breadth-first search (BFS) strategies. It adopts a hybrid work-stealing scheduling algorithm incorporating both work-first and help-first policies to achieve dynamic load balancing. The HRPF runtime system ensures data consistency between the host and the device and overlaps computation with data transfer. Additionally, HRPF provides a set of parallel loop programming interfaces. To evaluate HRPF, we implement several benchmarks including merge sort, quicksort, Strassen–Winograd matrix multiplication, and parallel loops in four commonly used algorithms. Experimental results on a CPU–GPU platform demonstrate that HRPF achieves superior performance across a range of benchmarks compared to OpenMP, StarPU and Taskflow.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103187"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache partitioning for sparse matrix–vector multiplication on the A64FX A64FX上稀疏矩阵向量乘法的缓存分区
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2025-12-04 DOI: 10.1016/j.parco.2025.103169
Sergej Breiter , James D. Trotter , Karl Fürlinger
{"title":"Cache partitioning for sparse matrix–vector multiplication on the A64FX","authors":"Sergej Breiter ,&nbsp;James D. Trotter ,&nbsp;Karl Fürlinger","doi":"10.1016/j.parco.2025.103169","DOIUrl":"10.1016/j.parco.2025.103169","url":null,"abstract":"<div><div>One of the novel features of the Fujitsu A64FX CPU is the <em>sector cache</em>. This feature enables hardware-supported partitioning of the L1 and L2 caches and allows the programmer control of which partition is used to place data in. This paper performs an in-depth study of applying the sector cache to sparse matrix-vector multiplication (SpMV) in the Compressed Sparse Row (CSR) format using a collection of 490 sparse matrices. A performance model based on reuse analysis is used to better understand situations in which and how the sector cache leads to improved cache reuse and to predict cache behavior. The model predicts the number of L2 cache misses within an error of 2% without cache partitioning. With sector cache enabled, depending on the configuration, the model predicts the number of L2 cache missed within 2–3% and 4–18% for sequential and parallel SpMV with 48 threads, respectively. Further experiments show the effect of various sector cache configurations on performance. A median speedup of about 1.05<span><math><mo>×</mo></math></span> is achieved, whereas the maximum speedup is about 1.6<span><math><mo>×</mo></math></span>.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103169"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145791306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Butterfly factorization for vision transformers on multi-IPU systems 多ipu系统中视觉变压器的蝴蝶分解
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2025-11-27 DOI: 10.1016/j.parco.2025.103165
S.-Kazem Shekofteh, Daniel Bogacz, Christian Alles, Holger Fröning
{"title":"Butterfly factorization for vision transformers on multi-IPU systems","authors":"S.-Kazem Shekofteh,&nbsp;Daniel Bogacz,&nbsp;Christian Alles,&nbsp;Holger Fröning","doi":"10.1016/j.parco.2025.103165","DOIUrl":"10.1016/j.parco.2025.103165","url":null,"abstract":"<div><div>Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accelerators, such as Graphcore’s Intelligence Processing Units (IPUs), are designed to address these challenges through massive parallelism and efficient on-chip memory utilization. In this paper, we extend our analysis of Butterfly structures for efficient utilization on single and multiple IPUs, comparing their performance with GPUs. These structures drastically reduce the number of parameters and memory footprint while preserving model accuracy. Experimental results on the Graphcore GC200 IPU chip, compared with an NVIDIA A30 GPU, demonstrate a 98.5% compression ratio, with speedups of 1.6<span><math><mo>×</mo></math></span> and 1.3<span><math><mo>×</mo></math></span> for Butterfly and Pixelated Butterfly structures, respectively. Extending our evaluation to Vision Transformer (ViT) models, we compare Multi-GPU and Multi-IPU systems on the M2000 machine: Multi-GPU reaches a maximum accuracy of 84.51% with a training time of 401.44 min, whereas Multi-IPU attains a higher maximum accuracy of 88.92% with a training time of 694.03 min. These results demonstrate that Butterfly factorization enables substantial compression of ViT layers (up to 97.17%) while improving model accuracy. The findings highlight the promise of IPU machines as a suitable platform for large-scale machine learning model training, especially when coupled with sparsification methods like Butterfly factorization, thanks to their efficient support for model parallelism.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103165"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145738516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of the impact of NUMA node configuration on the performance of offloading computations to GPUs 分析NUMA节点配置对gpu卸载计算性能的影响
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2025-12-19 DOI: 10.1016/j.parco.2025.103182
Sergey Malkovsky, Aleksei Sorokin, Sergey Korolev
{"title":"Analysis of the impact of NUMA node configuration on the performance of offloading computations to GPUs","authors":"Sergey Malkovsky,&nbsp;Aleksei Sorokin,&nbsp;Sergey Korolev","doi":"10.1016/j.parco.2025.103182","DOIUrl":"10.1016/j.parco.2025.103182","url":null,"abstract":"<div><div>The article presents the results of several studies to assess the impact of the configuration of NUMA (Non-Uniform Memory Access) nodes on the performance of GPU-accelerated applications in hybrid computing system with shared memory. Using Crossroads/N9 DGEMM (NVBLAS library) as a model application, the performance in various NUMA modes with one or more GPUs was analyzed, and the throughput of the memory subsystem and data transfer channels between the host memory and graphics processors was also measured. The impact of coprocessor distribution across NUMA nodes on the efficiency of the model application was also examined.</div><div>Results showed that configuration of NUMA nodes can have a significant impact on the performance of applications that offload calculations to graphics coprocessors in a hybrid computing system with shared memory, and this impact could have an effect in different ways. For example, using one NUMA node for the entire computing system is the least optimal approach in terms of memory bandwidth, but it provided the highest bandwidth for communication between host memory and coprocessors during active data transfer to several accelerators. Thus, this mode achieves maximum performance when performing calculations on multiple GPUs that actively exchange data through host memory. Other modes showed advantages in different situations. Overall, to achieve maximum performance during active data transfer to coprocessors, they should be part of one NUMA node. These results will help to develop approaches to configuration of hybrid computing systems on processors with a chiplet layout, and help to improve the performance of software that offloads calculations to graphics accelerators with the Ampere architecture, such as NVIDIA A800 and NVIDIA A100, which are currently widely represented in the high-performance computing industry.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103182"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145841285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PROAD: Boosting Caffe Training via improving LevelDB I/O performance with Parallel Read, Out-of-Order Optimization, and Adaptive Design PROAD:通过并行读取、乱序优化和自适应设计提高LevelDB I/O性能来促进Caffe训练
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2025-12-05 DOI: 10.1016/j.parco.2025.103181
Yubiao Pan, Ailing Tian, Huizhen Zhang
{"title":"PROAD: Boosting Caffe Training via improving LevelDB I/O performance with Parallel Read, Out-of-Order Optimization, and Adaptive Design","authors":"Yubiao Pan,&nbsp;Ailing Tian,&nbsp;Huizhen Zhang","doi":"10.1016/j.parco.2025.103181","DOIUrl":"10.1016/j.parco.2025.103181","url":null,"abstract":"<div><div>Caffe, one of the most popular deep learning frameworks, trains models by reading training data from the storage engine, LevelDB, and feeding it into the computation engine. This paper analyzes the challenges faced by data reading in Caffe Training: (1) Fetch, Parse, and Transform—the three steps of reading each image—are serial, and each image is read sequentially; (2) Frequent disk I/O—each image read triggers an I/O operation—significantly increases the data reading time; (3) Caffe calls LevelDB’s range query method to read training data, but this leads to unnecessary pointer comparison operations, wasting CPU resources; (4) Since LevelDB reads training data in key order during range queries, the fixed order of training data across epochs may cause overfitting and lower the model’s test accuracy.</div><div>Based on these challenges, this paper proposes Parallel Read, Out-of-Order Optimization, and Adaptive Design strategies to design a new I/O layer, PROAD, for Caffe that systematically reconstructs and optimizes LevelDB’s original data reading mechanism, thus improving LevelDB I/O performance for Caffe Training. The Parallel Read method pipelines the Fetch, Parse, and Transform steps and accelerates reading via large block reads; Out-of-Order Optimization discards the range scan feature of LevelDB, allowing Caffe to read training data in a random manner during training, avoiding the original key comparison overhead and providing a boost to model accuracy; while the Adaptive Design method supports efficient reading of training data with different resolutions. Based on these designs, this paper implements PROAD and deploys it in Caffe for performance evaluation. Experimental results show that Caffe with PROAD significantly improves data reading performance during training, especially for high-resolution datasets, where data reading time in Caffe with PROAD is reduced by 14%–42% compared to Caffe with LevelDB and 6%–34% compared to Caffe with LMDB. Furthermore, Caffe with PROAD improves model test accuracy due to the Out-of-Order Optimization strategy, while consuming relatively reasonable memory resources.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103181"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145738517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmark of classical disk array and software-defined storage on near-identical hardware 在几乎相同的硬件上对经典磁盘阵列和软件定义存储进行基准测试
IF 2.1 4区 计算机科学
Parallel Computing Pub Date : 2026-03-01 Epub Date: 2025-12-03 DOI: 10.1016/j.parco.2025.103166
Tomas Vondra, David Sebek
{"title":"Benchmark of classical disk array and software-defined storage on near-identical hardware","authors":"Tomas Vondra,&nbsp;David Sebek","doi":"10.1016/j.parco.2025.103166","DOIUrl":"10.1016/j.parco.2025.103166","url":null,"abstract":"<div><div>This article presents a comparative analysis of two storage approaches: a SAN disk array, exemplified by an HPE 3PAR device, and a software-defined storage cluster constructed with the Ceph software. The objective of this comparison is to ascertain whether a software-defined storage cluster built with commodity servers can achieve comparable performance to a SAN disk array with a similar hardware configuration. The configuration used identical numbers of components of matching speeds, capacities, and hardware generation from the same manufacturer. By relaxing some requirements on the software-defined storage, we were able to benchmark all RAID levels with corresponding replication and erasure code settings. The results revealed that 3PAR performed 31 times better for 4 KiB data block writes than Ceph. On the contrary, the Ceph cluster surpassed 3PAR by a factor of 1.4 in 16 MiB large-block reads. The differences are explained in the text based on the theory of operation of the two types of storage. We propose criteria for choosing the correct type of technology for individual use cases.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"127 ","pages":"Article 103166"},"PeriodicalIF":2.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145738515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书