Classical bounds on two-outcome bipartite Bell expressions and linear prepare-and-measure witnesses: Efficient computation in parallel environments such as graphics processing units
IF 3.4 2区 物理与天体物理Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
István Márton , Erika Bene , Péter Diviánszky , Gábor Drótos
{"title":"Classical bounds on two-outcome bipartite Bell expressions and linear prepare-and-measure witnesses: Efficient computation in parallel environments such as graphics processing units","authors":"István Márton , Erika Bene , Péter Diviánszky , Gábor Drótos","doi":"10.1016/j.cpc.2025.109809","DOIUrl":null,"url":null,"abstract":"<div><div>The presented program aims at speeding up the brute force computation of the so-called <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm of a matrix <em>M</em> using graphics processing units (GPUs). Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment. The <span><math><mi>n</mi><mo>×</mo><mi>m</mi></math></span> matrix <em>M</em> has real elements which may represent coefficients of a bipartite Bell expression or those of a linear prepare-and-measure (PM) witness. In this interpretation, the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> norm is the local bound of the given correlation-type Bell expression, and the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm for <span><math><mi>d</mi><mo>≥</mo><mn>2</mn></math></span> is the classical <em>d</em>-dimensional bound of the given PM witness, which is associated with the communication of <em>d</em>-level classical messages in the PM scenario. The program is also capable of calculating the local bound of Bell expressions including marginals. In all scenarios, the output is assumed to be binary.</div><div>The code for GPUs is written in CUDA C and can utilize one NVIDIA GPU in a computer. To illustrate the performance of our implementation, we refer to Brierley et al. <span><span>[1]</span></span> who needed approximately three weeks to compute the local bound on a Bell expression defined by a <span><math><mn>42</mn><mo>×</mo><mn>42</mn></math></span> matrix on a standard desktop using a single CPU core. In contrast, our efficient implementation of the brute force algorithm allows us to reduce this to three minutes using a single NVIDIA RTX 6000 Ada graphics card on a workstation. For CPUs, the algorithm was implemented with OpenMP and MPI according to the shared and distributed memory models, respectively, and achieves a comparable speedup at a number of CPU cores around 100.</div></div><div><h3>Program summary</h3><div><em>Program Title:</em> L_CUDA.cu, L_MPI.c, L_OpenMP.c</div><div><em>CPC Library link to program files:</em> <span><span>https://doi.org/10.17632/scfjjt9svm.1</span><svg><path></path></svg></span></div><div><em>Developer's repository link:</em> <span><span>https://github.com/istvanmarton/L-norms_BruteForce</span><svg><path></path></svg></span></div><div><em>Licensing provisions:</em> GPLv3</div><div><em>Programming language:</em> C, CUDA, OpenMP, MPI</div><div><em>Nature of problem:</em> The computational demand of determining the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm of a matrix of real coefficients is high; exact <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norms have been computed so far for relatively small matrices only. Besides that any exact algorithm appears to scale exponentially with the number of rows (or the minimum of rows and columns, for <span><math><mi>d</mi><mo>=</mo><mn>1</mn></math></span>), the naive approach to brute-force compute an <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>d</mi></mrow></msub></math></span> norm features a contribution that scales linearly with the product of the number of rows and columns, because every matrix element needs to be accessed repeatedly. Improving efficiency is thus both desirable and possibly feasible.</div><div><em>Solution method:</em> We attenuate the mentioned secondary contribution by accessing elements only from a single row of the matrix in each step, achieving a linear scaling with the number of columns in terms of memory access. Even though we perform a search for the relevant row, scaling linearly with the number of rows in terms of the number of operations, this precedes the access to the matrix elements, so that multiplication does not occur for time complexity. Besides identifying further important mathematical shortcuts, the problem is very well suited to parallelization, which we take advantage of. In particular, we provide an implementation for graphics processing units besides universal processors.</div><div><em>Additional comments including restrictions and unusual features:</em> The entries of the input matrix <em>M</em> must be provided as integers (i.e., conversion from floating-point entries is necessary). The program can utilize a single GPU on a computer.</div></div>","PeriodicalId":285,"journal":{"name":"Computer Physics Communications","volume":"316 ","pages":"Article 109809"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Physics Communications","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S001046552500311X","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
The presented program aims at speeding up the brute force computation of the so-called norm of a matrix M using graphics processing units (GPUs). Alternatives for CPUs have also been implemented, and the algorithm is applicable to any parallel environment. The matrix M has real elements which may represent coefficients of a bipartite Bell expression or those of a linear prepare-and-measure (PM) witness. In this interpretation, the norm is the local bound of the given correlation-type Bell expression, and the norm for is the classical d-dimensional bound of the given PM witness, which is associated with the communication of d-level classical messages in the PM scenario. The program is also capable of calculating the local bound of Bell expressions including marginals. In all scenarios, the output is assumed to be binary.
The code for GPUs is written in CUDA C and can utilize one NVIDIA GPU in a computer. To illustrate the performance of our implementation, we refer to Brierley et al. [1] who needed approximately three weeks to compute the local bound on a Bell expression defined by a matrix on a standard desktop using a single CPU core. In contrast, our efficient implementation of the brute force algorithm allows us to reduce this to three minutes using a single NVIDIA RTX 6000 Ada graphics card on a workstation. For CPUs, the algorithm was implemented with OpenMP and MPI according to the shared and distributed memory models, respectively, and achieves a comparable speedup at a number of CPU cores around 100.
Program summary
Program Title: L_CUDA.cu, L_MPI.c, L_OpenMP.c
CPC Library link to program files:https://doi.org/10.17632/scfjjt9svm.1
Nature of problem: The computational demand of determining the norm of a matrix of real coefficients is high; exact norms have been computed so far for relatively small matrices only. Besides that any exact algorithm appears to scale exponentially with the number of rows (or the minimum of rows and columns, for ), the naive approach to brute-force compute an norm features a contribution that scales linearly with the product of the number of rows and columns, because every matrix element needs to be accessed repeatedly. Improving efficiency is thus both desirable and possibly feasible.
Solution method: We attenuate the mentioned secondary contribution by accessing elements only from a single row of the matrix in each step, achieving a linear scaling with the number of columns in terms of memory access. Even though we perform a search for the relevant row, scaling linearly with the number of rows in terms of the number of operations, this precedes the access to the matrix elements, so that multiplication does not occur for time complexity. Besides identifying further important mathematical shortcuts, the problem is very well suited to parallelization, which we take advantage of. In particular, we provide an implementation for graphics processing units besides universal processors.
Additional comments including restrictions and unusual features: The entries of the input matrix M must be provided as integers (i.e., conversion from floating-point entries is necessary). The program can utilize a single GPU on a computer.
期刊介绍:
The focus of CPC is on contemporary computational methods and techniques and their implementation, the effectiveness of which will normally be evidenced by the author(s) within the context of a substantive problem in physics. Within this setting CPC publishes two types of paper.
Computer Programs in Physics (CPiP)
These papers describe significant computer programs to be archived in the CPC Program Library which is held in the Mendeley Data repository. The submitted software must be covered by an approved open source licence. Papers and associated computer programs that address a problem of contemporary interest in physics that cannot be solved by current software are particularly encouraged.
Computational Physics Papers (CP)
These are research papers in, but are not limited to, the following themes across computational physics and related disciplines.
mathematical and numerical methods and algorithms;
computational models including those associated with the design, control and analysis of experiments; and
algebraic computation.
Each will normally include software implementation and performance details. The software implementation should, ideally, be available via GitHub, Zenodo or an institutional repository.In addition, research papers on the impact of advanced computer architecture and special purpose computers on computing in the physical sciences and software topics related to, and of importance in, the physical sciences may be considered.