ACM Transactions on Mathematical Software最新文献_第3页

Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors 在英特尔、AMD和富士通处理器上批量、小矩阵和矩形矩阵乘法的缓存优化和性能建模

1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-09-19 DOI: 10.1145/3595178

Sameer Deshmukh, Rio Yokota, George Bosilca

{"title":"Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors","authors":"Sameer Deshmukh, Rio Yokota, George Bosilca","doi":"10.1145/3595178","DOIUrl":"https://doi.org/10.1145/3595178","url":null,"abstract":"Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on three CPUs using diverse ISAs – the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512, and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135059806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New subspace method for unconstrained derivative-free optimization 无约束无导数优化的新子空间方法

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-09-02 DOI: 10.1145/3618297

M. Kimiaei, A. Neumaier, Parvaneh Faramarzi

引用次数: 1

IEEE-754 precision-p base-β arithmetic implemented in binary 用二进制实现的IEEE-754精度-p基-β算术

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-08-21 DOI: 10.1145/3596218

S. Rump

{"title":"IEEE-754 precision-p base-β arithmetic implemented in binary","authors":"S. Rump","doi":"10.1145/3596218","DOIUrl":"https://doi.org/10.1145/3596218","url":null,"abstract":"We show how an IEEE-754 conformant precision-p base-β arithmetic can be implemented based on some binary floating-point and/or integer arithmetic. This includes the four basic operations and square root subject to the five IEEE-754 rounding modes, namely the nearest roundings with roundTiesToEven and roundTiesToAway, the directed roundings downwards and upwards, as well as rounding towards zero. Exceptional values like ∞ of NaN are covered according to the IEEE-754 arithmetic standard. The results of the precision-p base-β operations are computed using some underlying precision-q binary arithmetic. We distinguish two cases. When using a precision-q binary integer arithmetic, the base-β precision p is limited for all operations by β2p ≤ 2q, whereas using a precision-q binary floating-point arithmetic imposes stronger limits on the base-β precision, namely β2p ≤ 2q for addition and multiplication, β2p ≤ 2q − 1 for division and β2p ≤ 2q − 3 for the square root. Those limitations cannot be improved. The algorithms are implemented in a Matlab/Octave flbeta-toolbox with the choice of using uint64 or binary64 as underlying arithmetic. The former allows larger precisions, the latter is advantageous for the square root, whereas computing times are similar. The flbeta-toolbox offers precision-p base-β scalar, vector and matrix operations including sparse matrices as well as corresponding interval operations. The base β can be chosen in the range β ∈ [2, 64]. The flbeta-toolbox will be part of Version 13 of INTLAB [18], the Matlab/Octave toolbox for reliable computing.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"1 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41531528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Algorithm xxxx: KCC: A MATLAB Package for K-means-based Consensus Clustering 算法xxxx: KCC:基于k均值的共识聚类的MATLAB包

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-08-15 DOI: 10.1145/3616011

Hao Lin, Hongfu Liu, Junjie Wu, Hong Li, Stephan Günnemann

引用次数: 1

Sparse Approximate Multifrontal Factorization with Composite Compression Methods 复合压缩方法的稀疏近似多前沿因子分解

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-08-01 DOI: 10.1145/3611662

Lisa Claus, P. Ghysels, Yang Liu, T. Nhan, R. Thirumalaisamy, A. Bhalla, Sherry Li

{"title":"Sparse Approximate Multifrontal Factorization with Composite Compression Methods","authors":"Lisa Claus, P. Ghysels, Yang Liu, T. Nhan, R. Thirumalaisamy, A. Bhalla, Sherry Li","doi":"10.1145/3611662","DOIUrl":"https://doi.org/10.1145/3611662","url":null,"abstract":"This article presents a fast and approximate multifrontal solver for large sparse linear systems. In a recent work by Liu et al., we showed the efficiency of a multifrontal solver leveraging the butterfly algorithm and its hierarchical matrix extension, HODBF (hierarchical off-diagonal butterfly) compression to compress large frontal matrices. The resulting multifrontal solver can attain quasi-linear computation and memory complexity when applied to sparse linear systems arising from spatial discretization of high-frequency wave equations. To further reduce the overall number of operations and especially the factorization memory usage to scale to larger problem sizes, in this article we develop a composite multifrontal solver that employs the HODBF format for large-sized fronts, a reduced-memory version of the nonhierarchical block low-rank format for medium-sized fronts, and a lossy compression format for small-sized fronts. This allows us to solve sparse linear systems of dimension up to 2.7 × larger than before and leads to a memory consumption that is reduced by 70% while ensuring the same execution time. The code is made publicly available in GitHub.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"49 1","pages":"1 - 28"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45941947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

emgr – EMpirical GRamian Framework Version 5.99 emgr -经验语法框架版本5.99

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-07-20 DOI: https://dl.acm.org/doi/10.1145/3609860

Christian Himpe

引用次数: 0

IFISS3D: A computational laboratory for investigating finite element approximation in three dimensions IFISS3D:一个用于研究三维有限元近似的计算实验室

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3604934

Georgios Papanikos, Catherine E. Powell, David J. Silvester

引用次数: 0

Approximating inverse cumulative distribution functions to produce approximate random variables 近似逆累积分布函数以产生近似随机变量

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-06-17 DOI: https://dl.acm.org/doi/10.1145/3604935

Michael Giles, Oliver Sheridan-Methven

{"title":"Approximating inverse cumulative distribution functions to produce approximate random variables","authors":"Michael Giles, Oliver Sheridan-Methven","doi":"https://dl.acm.org/doi/10.1145/3604935","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3604935","url":null,"abstract":"For random variables produced through the inverse transform method, approximate random variables are introduced, which are produced using approximations to a distribution’s inverse cumulative distribution function. These approximations are designed to be computationally inexpensive, and much cheaper than library functions which are exact to within machine precision, and thus highly suitable for use in Monte Carlo simulations. The approximation errors they introduce can then be eliminated through use of the multilevel Monte Carlo method. Two approximations are presented for the Gaussian distribution: a piecewise constant on equally spaced intervals, and a piecewise linear using geometrically decaying intervals. The errors of the approximations are bounded and the convergence demonstrated, and the computational savings measured for C and C++ implementations. Implementations tailored for Intel and Arm hardware are inspected, alongside hardware agnostic implementations built using OpenMP. The savings are incorporated into a nested multilevel Monte Carlo framework with the Euler-Maruyama scheme to exploit the speed ups without losing accuracy, offering speed ups by a factor of 5–7. These ideas are empirically extended to the Milstein scheme, and the non-central χ2 distribution for the Cox-Ingersoll-Ross process, offering speed ups of a factor of 250 or more.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"86 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138537789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CPFloat: A C Library for Simulating Low-precision Arithmetic 一个模拟低精度算术的C语言库

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-06-17 DOI: https://dl.acm.org/doi/10.1145/3585515

Massimiliano Fasi, Mantas Mikaitis

{"title":"CPFloat: A C Library for Simulating Low-precision Arithmetic","authors":"Massimiliano Fasi, Mantas Mikaitis","doi":"https://dl.acm.org/doi/10.1145/3585515","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3585515","url":null,"abstract":"One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"69 ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task-based Parallel Programming for Scalable Matrix Product Algorithms 基于任务的可扩展矩阵积算法并行编程

IF 2.7 1区数学

ACM Transactions on Mathematical Software Pub Date : 2023-06-15 DOI: https://dl.acm.org/doi/10.1145/3583560

Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, Julien Herrmann, Antoine Jego

{"title":"Task-based Parallel Programming for Scalable Matrix Product Algorithms","authors":"Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, Julien Herrmann, Antoine Jego","doi":"https://dl.acm.org/doi/10.1145/3583560","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3583560","url":null,"abstract":"Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.","PeriodicalId":50935,"journal":{"name":"ACM Transactions on Mathematical Software","volume":"63 ","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0