{"title":"Comment on paper: Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems","authors":"Yimeng Min","doi":"arxiv-2406.09441","DOIUrl":"https://doi.org/arxiv-2406.09441","url":null,"abstract":"We identify two major issues in the SoftDist paper (Xia et al.): (1) the\u0000failure to run all steps of different baselines on the same hardware\u0000environment, and (2) the use of inconsistent time measurements when comparing\u0000to other baselines. These issues lead to flawed conclusions. When all steps are\u0000executed in the same hardware environment, the primary claim made in SoftDist\u0000is no longer supported.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comments on \"Federated Learning with Differential Privacy: Algorithms and Performance Analysis\"","authors":"Mahtab Talaei, Iman Izadi","doi":"arxiv-2406.05858","DOIUrl":"https://doi.org/arxiv-2406.05858","url":null,"abstract":"In the paper by Wei et al. (\"Federated Learning with Differential Privacy:\u0000Algorithms and Performance Analysis\"), the convergence performance of the\u0000proposed differential privacy algorithm in federated learning (FL), known as\u0000Noising before Model Aggregation FL (NbAFL), was studied. However, the\u0000presented convergence upper bound of NbAFL (Theorem 2) is incorrect. This\u0000comment aims to present the correct form of the convergence upper bound for\u0000NbAFL.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead","authors":"Amir Zandieh, Majid Daliri, Insu Han","doi":"arxiv-2406.03482","DOIUrl":"https://doi.org/arxiv-2406.03482","url":null,"abstract":"Serving LLMs requires substantial memory due to the storage requirements of\u0000Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An\u0000effective approach to compress KV cache is quantization. However, traditional\u0000quantization methods face significant memory overhead due to the need to store\u0000quantization constants (at least a zero point and a scale) in full precision\u0000per data block. Depending on the block size, this overhead can add 1 or 2 bits\u0000per quantized number. We introduce QJL, a new quantization approach that\u0000consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit\u0000quantization. In contrast to existing methods, QJL eliminates memory overheads\u0000by removing the need for storing quantization constants. We propose an\u0000asymmetric estimator for the inner product of two vectors and demonstrate that\u0000applying QJL to one vector and a standard JL transform without quantization to\u0000the other provides an unbiased estimator with minimal distortion. We have\u0000developed an efficient implementation of the QJL sketch and its corresponding\u0000inner product estimator, incorporating a lightweight CUDA kernel for optimized\u0000computation. When applied across various LLMs and NLP tasks to quantize the KV\u0000cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV\u0000cache memory usage without compromising accuracy, all while achieving faster\u0000runtime. Codes are available at url{https://github.com/amirzandieh/QJL}.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141549938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of Generative AI (Large Language Models) on the PRA model construction and maintenance, observations","authors":"Valentin RychkovEDF R&D, Claudia PicocoEDF R&D, Emilie CalecaEDF R&D","doi":"arxiv-2406.01133","DOIUrl":"https://doi.org/arxiv-2406.01133","url":null,"abstract":"The rapid development of Large Language Models (LLMs) and Generative\u0000Pre-Trained Transformers(GPTs) in the field of Generative Artificial\u0000Intelligence (AI) can significantly impact task automation in themodern\u0000economy. We anticipate that the PRA field will inevitably be affected by this\u0000technology1. Thus, themain goal of this paper is to engage the risk assessment\u0000community into a discussion of benefits anddrawbacks of this technology for\u0000PRA. We make a preliminary analysis of possible application of LLM\u0000inProbabilistic Risk Assessment (PRA) modeling context referring to the ongoing\u0000experience in softwareengineering field. We explore potential application\u0000scenarios and the necessary conditions for controlledLLM usage in PRA modeling\u0000(whether static or dynamic). Additionally, we consider the potential impact\u0000ofthis technology on PRA modeling tools.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ranking with Ties based on Noisy Performance Data","authors":"Aravind Sankaran, Lars Karlsson, Paolo Bientinesi","doi":"arxiv-2405.18259","DOIUrl":"https://doi.org/arxiv-2405.18259","url":null,"abstract":"We consider the problem of ranking a set of objects based on their\u0000performance when the measurement of said performance is subject to noise. In\u0000this scenario, the performance is measured repeatedly, resulting in a range of\u0000measurements for each object. If the ranges of two objects do not overlap, then\u0000we consider one object as 'better' than the other, and we expect it to receive\u0000a higher rank; if, however, the ranges overlap, then the objects are\u0000incomparable, and we wish them to be assigned the same rank. Unfortunately, the\u0000incomparability relation of ranges is in general not transitive; as a\u0000consequence, in general the two requirements cannot be satisfied\u0000simultaneously, i.e., it is not possible to guarantee both distinct ranks for\u0000objects with separated ranges, and same rank for objects with overlapping\u0000ranges. This conflict leads to more than one reasonable way to rank a set of\u0000objects. In this paper, we explore the ambiguities that arise when ranking with\u0000ties, and define a set of reasonable rankings, which we call partial rankings.\u0000We develop and analyse three different methodologies to compute a partial\u0000ranking. Finally, we show how performance differences among objects can be\u0000investigated with the help of partial ranking.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141165951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Analysis of Performance Bottlenecks in MRI Pre-Processing","authors":"Mathieu Dugré, Yohan Chatelain, Tristan Glatard","doi":"arxiv-2405.17650","DOIUrl":"https://doi.org/arxiv-2405.17650","url":null,"abstract":"Magnetic Resonance Image (MRI) pre-processing is a critical step for\u0000neuroimaging analysis. However, the computational cost of MRI pre-processing\u0000pipelines is a major bottleneck for large cohort studies and some clinical\u0000applications. While High-Performance Computing (HPC) and, more recently, Deep\u0000Learning have been adopted to accelerate the computations, these techniques\u0000require costly hardware and are not accessible to all researchers. Therefore,\u0000it is important to understand the performance bottlenecks of MRI pre-processing\u0000pipelines to improve their performance. Using Intel VTune profiler, we\u0000characterized the bottlenecks of several commonly used MRI-preprocessing\u0000pipelines from the ANTs, FSL, and FreeSurfer toolboxes. We found that few\u0000functions contributed to most of the CPU time, and that linear interpolation\u0000was the largest contributor. Data access was also a substantial bottleneck. We\u0000identified a bug in the ITK library that impacts the performance of ANTs\u0000pipeline in single-precision and a potential issue with the OpenMP scaling in\u0000FreeSurfer recon-all. Our results provide a reference for future efforts to\u0000optimize MRI pre-processing pipelines.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"129 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Vellas, Bill Psomas, Kalliopi Karadima, Dimitrios Danopoulos, Alexandros Paterakis, George Lentaris, Dimitrios Soudris, Konstantinos Karantzalos
{"title":"Evaluation of Resource-Efficient Crater Detectors on Embedded Systems","authors":"Simon Vellas, Bill Psomas, Kalliopi Karadima, Dimitrios Danopoulos, Alexandros Paterakis, George Lentaris, Dimitrios Soudris, Konstantinos Karantzalos","doi":"arxiv-2405.16953","DOIUrl":"https://doi.org/arxiv-2405.16953","url":null,"abstract":"Real-time analysis of Martian craters is crucial for mission-critical\u0000operations, including safe landings and geological exploration. This work\u0000leverages the latest breakthroughs for on-the-edge crater detection aboard\u0000spacecraft. We rigorously benchmark several YOLO networks using a Mars craters\u0000dataset, analyzing their performance on embedded systems with a focus on\u0000optimization for low-power devices. We optimize this process for a new wave of\u0000cost-effective, commercial-off-the-shelf-based smaller satellites.\u0000Implementations on diverse platforms, including Google Coral Edge TPU, AMD\u0000Versal SoC VCK190, Nvidia Jetson Nano and Jetson AGX Orin, undergo a detailed\u0000trade-off analysis. Our findings identify optimal network-device pairings,\u0000enhancing the feasibility of crater detection on resource-constrained hardware\u0000and setting a new precedent for efficient and resilient extraterrestrial\u0000imaging. Code at: https://github.com/billpsomas/mars_crater_detection.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AmBC-NOMA-Aided Short-Packet Communication for High Mobility V2X Transmissions","authors":"Xinyue Pei, Xingwei Wang, Yingyang Chen, Tingrui Pei, Miaowen Wen","doi":"arxiv-2405.16502","DOIUrl":"https://doi.org/arxiv-2405.16502","url":null,"abstract":"In this paper, we investigate the performance of ambient backscatter\u0000communication non-orthogonal multiple access (AmBC-NOMA)-assisted short packet\u0000communication for high-mobility vehicle-to-everything transmissions. In the\u0000proposed system, a roadside unit (RSU) transmits a superimposed signal to a\u0000typical NOMA user pair. Simultaneously, the backscatter device (BD) transmits\u0000its own signal towards the user pair by reflecting and modulating the RSU's\u0000superimposed signals. Due to vehicles' mobility, we consider realistic\u0000assumptions of time-selective fading and channel estimation errors. Theoretical\u0000expressions for the average block error rates (BLERs) of both users are\u0000derived. Furthermore, analysis and insights on transmit signal-to-noise ratio,\u0000vehicles' mobility, imperfect channel estimation, the reflection efficiency at\u0000the BD, and blocklength are provided. Numerical results validate the\u0000theoretical findings and reveal that the AmBC-NOMA system outperforms its\u0000orthogonal multiple access counterpart in terms of BLER performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dmitrii Khizbullin, Eduardo Rocha de Andrade, Thanh Hau Nguyen, Matheus Pedroza Ferreira, David R. Pugh
{"title":"Graph neural networks with configuration cross-attention for tensor compilers","authors":"Dmitrii Khizbullin, Eduardo Rocha de Andrade, Thanh Hau Nguyen, Matheus Pedroza Ferreira, David R. Pugh","doi":"arxiv-2405.16623","DOIUrl":"https://doi.org/arxiv-2405.16623","url":null,"abstract":"With the recent popularity of neural networks comes the need for efficient\u0000serving of inference workloads. A neural network inference workload can be\u0000represented as a computational graph with nodes as operators transforming\u0000multidimensional tensors. The tensors can be transposed and/or tiled in a\u0000combinatorially large number of ways, some configurations leading to\u0000accelerated inference. We propose TGraph, a neural graph architecture that\u0000allows screening for fast configurations of the target computational graph,\u0000thus representing an artificial intelligence (AI) tensor compiler in contrast\u0000to the traditional heuristics-based compilers. The proposed solution improves\u0000mean Kendall's $tau$ across layout collections of TpuGraphs from 29.8% of the\u0000reliable baseline to 67.4% of TGraph. We estimate the potential CO$_2$ emission\u0000reduction associated with our work to be equivalent to over 50% of the total\u0000household emissions in the areas hosting AI-oriented data centers.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141165943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Experimental Study of Different Aggregation Schemes in Semi-Asynchronous Federated Learning","authors":"Yunbo Li, Jiaping Gui, Yue Wu","doi":"arxiv-2405.16086","DOIUrl":"https://doi.org/arxiv-2405.16086","url":null,"abstract":"Federated learning is highly valued due to its high-performance computing in\u0000distributed environments while safeguarding data privacy. To address resource\u0000heterogeneity, researchers have proposed a semi-asynchronous federated learning\u0000(SAFL) architecture. However, the performance gap between different aggregation\u0000targets in SAFL remain unexplored. In this paper, we systematically compare the performance between two\u0000algorithm modes, FedSGD and FedAvg that correspond to aggregating gradients and\u0000models, respectively. Our results across various task scenarios indicate these\u0000two modes exhibit a substantial performance gap. Specifically, FedSGD achieves\u0000higher accuracy and faster convergence but experiences more severe fluctuates\u0000in accuracy, whereas FedAvg excels in handling straggler issues but converges\u0000slower with reduced accuracy.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2016 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141166243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}