发布求助

文献互助智能选刊最新文献

Classical bounds on two-outcome bipartite Bell expressions and linear prepare-and-measure witnesses: Efficient computation in parallel environments such as graphics processing units

IF 3.4 2区物理与天体物理 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer Physics Communications Pub Date : 2025-08-11 DOI:10.1016/j.cpc.2025.109809

István Márton , Erika Bene , Péter Diviánszky , Gábor Drótos

{"title":"Classical bounds on two-outcome bipartite Bell expressions and linear prepare-and-measure witnesses: Efficient computation in parallel environments such as graphics processing units","authors":"István Márton , Erika Bene , Péter Diviánszky , Gábor Drótos","doi":"10.1016/j.cpc.2025.109809","DOIUrl":null,"url":null,"abstract":"<div><div>The presented program aims at speeding up the brute force computation of the so-called <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm of a matrix <em>M</em> using graphics processing units (GPUs). Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment. The <span><math><mi>n</mi><mo>×</mo><mi>m</mi></math></span> matrix <em>M</em> has real elements which may represent coefficients of a bipartite Bell expression or those of a linear prepare-and-measure (PM) witness. In this interpretation, the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> norm is the local bound of the given correlation-type Bell expression, and the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm for <span><math><mi>d</mi><mo>≥</mo><mn>2</mn></math></span> is the classical <em>d</em>-dimensional bound of the given PM witness, which is associated with the communication of <em>d</em>-level classical messages in the PM scenario. The program is also capable of calculating the local bound of Bell expressions including marginals. In all scenarios, the output is assumed to be binary.</div><div>The code for GPUs is written in CUDA C and can utilize one NVIDIA GPU in a computer. To illustrate the performance of our implementation, we refer to Brierley et al. <span><span>[1]</span></span> who needed approximately three weeks to compute the local bound on a Bell expression defined by a <span><math><mn>42</mn><mo>×</mo><mn>42</mn></math></span> matrix on a standard desktop using a single CPU core. In contrast, our efficient implementation of the brute force algorithm allows us to reduce this to three minutes using a single NVIDIA RTX 6000 Ada graphics card on a workstation. For CPUs, the algorithm was implemented with OpenMP and MPI according to the shared and distributed memory models, respectively, and achieves a comparable speedup at a number of CPU cores around 100.</div></div><div><h3>Program summary</h3><div><em>Program Title:</em> L_CUDA.cu, L_MPI.c, L_OpenMP.c</div><div><em>CPC Library link to program files:</em> <span><span>https://doi.org/10.17632/scfjjt9svm.1</span><svg><path></path></svg></span></div><div><em>Developer's repository link:</em> <span><span>https://github.com/istvanmarton/L-norms_BruteForce</span><svg><path></path></svg></span></div><div><em>Licensing provisions:</em> GPLv3</div><div><em>Programming language:</em> C, CUDA, OpenMP, MPI</div><div><em>Nature of problem:</em> The computational demand of determining the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm of a matrix of real coefficients is high; exact <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norms have been computed so far for relatively small matrices only. Besides that any exact algorithm appears to scale exponentially with the number of rows (or the minimum of rows and columns, for <span><math><mi>d</mi><mo>=</mo><mn>1</mn></math></span>), the naive approach to brute-force compute an <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm features a contribution that scales linearly with the product of the number of rows and columns, because every matrix element needs to be accessed repeatedly. Improving efficiency is thus both desirable and possibly feasible.</div><div><em>Solution method:</em> We attenuate the mentioned secondary contribution by accessing elements only from a single row of the matrix in each step, achieving a linear scaling with the number of columns in terms of memory access. Even though we perform a search for the relevant row, scaling linearly with the number of rows in terms of the number of operations, this precedes the access to the matrix elements, so that multiplication does not occur for time complexity. Besides identifying further important mathematical shortcuts, the problem is very well suited to parallelization, which we take advantage of. In particular, we provide an implementation for graphics processing units besides universal processors.</div><div><em>Additional comments including restrictions and unusual features:</em> The entries of the input matrix <em>M</em> must be provided as integers (i.e., conversion from floating-point entries is necessary). The program can utilize a single GPU on a computer.</div></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"316 ","pages":"Article 109809"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S001046552500311X","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The presented program aims at speeding up the brute force computation of the so-called

L_{d}

norm of a matrix M using graphics processing units (GPUs). Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment. The

n \times m

matrix M has real elements which may represent coefficients of a bipartite Bell expression or those of a linear prepare-and-measure (PM) witness. In this interpretation, the

L_{1}

norm is the local bound of the given correlation-type Bell expression, and the

L_{d}

norm for

d \geq 2

is the classical d-dimensional bound of the given PM witness, which is associated with the communication of d-level classical messages in the PM scenario. The program is also capable of calculating the local bound of Bell expressions including marginals. In all scenarios, the output is assumed to be binary.

The code for GPUs is written in CUDA C and can utilize one NVIDIA GPU in a computer. To illustrate the performance of our implementation, we refer to Brierley et al. [1] who needed approximately three weeks to compute the local bound on a Bell expression defined by a

42 \times 42

matrix on a standard desktop using a single CPU core. In contrast, our efficient implementation of the brute force algorithm allows us to reduce this to three minutes using a single NVIDIA RTX 6000 Ada graphics card on a workstation. For CPUs, the algorithm was implemented with OpenMP and MPI according to the shared and distributed memory models, respectively, and achieves a comparable speedup at a number of CPU cores around 100.

Program summary

Program Title: L_CUDA.cu, L_MPI.c, L_OpenMP.c

CPC Library link to program files: https://doi.org/10.17632/scfjjt9svm.1

Developer's repository link: https://github.com/istvanmarton/L-norms_BruteForce

Licensing provisions: GPLv3

Programming language: C, CUDA, OpenMP, MPI

Nature of problem: The computational demand of determining the

L_{d}

norm of a matrix of real coefficients is high; exact

L_{d}

norms have been computed so far for relatively small matrices only. Besides that any exact algorithm appears to scale exponentially with the number of rows (or the minimum of rows and columns, for

d = 1

), the naive approach to brute-force compute an

L_{d}

norm features a contribution that scales linearly with the product of the number of rows and columns, because every matrix element needs to be accessed repeatedly. Improving efficiency is thus both desirable and possibly feasible.

Solution method: We attenuate the mentioned secondary contribution by accessing elements only from a single row of the matrix in each step, achieving a linear scaling with the number of columns in terms of memory access. Even though we perform a search for the relevant row, scaling linearly with the number of rows in terms of the number of operations, this precedes the access to the matrix elements, so that multiplication does not occur for time complexity. Besides identifying further important mathematical shortcuts, the problem is very well suited to parallelization, which we take advantage of. In particular, we provide an implementation for graphics processing units besides universal processors.

Additional comments including restrictions and unusual features: The entries of the input matrix M must be provided as integers (i.e., conversion from floating-point entries is necessary). The program can utilize a single GPU on a computer.

查看原文本刊更多论文

双结果二部贝尔表达式的经典界和线性准备测量见证：图形处理单元等并行环境下的高效计算

提出的程序旨在加速使用图形处理单元（gpu）的所谓矩阵M的Ld范数的蛮力计算。cpu的替代方案也已经实现，并且该算法适用于任何并行环境。n×m矩阵M具有实数元，可以表示二部贝尔表达式的系数或线性准备和测量（PM）见证的系数。在这种解释中，L1范数是给定相关型贝尔表达式的局部界，d≥2的Ld范数是给定PM见证的经典d维界，它与PM场景中d级经典消息的通信有关。该程序还能够计算贝尔表达式的局部边界，包括边际。在所有场景中，输出都假定为二进制。GPU的代码是用CUDA C编写的，可以在一台计算机中使用一个NVIDIA GPU。为了说明我们实现的性能，我们引用Brierley等人的例子，他们需要大约三周的时间来计算一个贝尔表达式的局部边界，该表达式由一个42×42矩阵定义，在一个使用单个CPU核心的标准桌面上。相比之下，我们高效的暴力破解算法使我们能够在工作站上使用单个NVIDIA RTX 6000 Ada显卡将此时间减少到三分钟。对于CPU，该算法分别根据共享和分布式内存模型使用OpenMP和MPI实现，在大约100个CPU核的情况下实现了相当的加速。项目简介项目名称：L_CUDA。L_OpenMP. cu, L_OpenMP. c, L_OpenMP. cu。cCPC库链接到程序文件：https://doi.org/10.17632/scfjjt9svm.1Developer's存储库链接：https://github.com/istvanmarton/L-norms_BruteForceLicensing规定：gplv3编程语言：C， CUDA, OpenMP， mp3问题的性质：确定实系数矩阵的Ld范数的计算量高；到目前为止，精确的Ld范数只计算了相对较小的矩阵。此外，任何精确的算法似乎都随着行数（或最小行数和列数，对于d=1）呈指数级增长，暴力计算Ld范数的朴素方法的特点是，它的贡献与行数和列数的乘积呈线性增长，因为每个矩阵元素都需要被反复访问。因此，提高效率既是可取的，也是可能可行的。解决方法：我们通过在每一步中仅从矩阵的单行访问元素来减弱上述次要贡献，从而在内存访问方面实现与列数的线性缩放。即使我们对相关行执行搜索，根据操作次数线性地按行数缩放，这也先于对矩阵元素的访问，因此，由于时间复杂性，不会发生乘法。除了进一步确定重要的数学捷径之外，这个问题非常适合我们利用的并行化。特别地，我们提供了除通用处理器之外的图形处理单元的实现。额外的注释包括限制和不寻常的特性：输入矩阵M的条目必须以整数形式提供（也就是说，必须从浮点条目进行转换）。该程序可以利用计算机上的单个GPU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Physics Communications 物理-计算机：跨学科应用

CiteScore

12.10

自引率

3.20%

发文量

287

审稿时长

5.3 months

期刊介绍： The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper. Computer Programs in Physics (CPiP) These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged. Computational Physics Papers (CP) These are research papers in, but are not limited to, the following themes across computational physics and related disciplines. mathematical and numerical methods and algorithms; computational models including those associated with the design, control and analysis of experiments; and algebraic computation. Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.