{"title":"Toward Smart Scheduling in Tapis","authors":"Joe Stubbs, Smruti Padhy, Richard Cardone","doi":"arxiv-2408.03349","DOIUrl":"https://doi.org/arxiv-2408.03349","url":null,"abstract":"The Tapis framework provides APIs for automating job execution on remote\u0000resources, including HPC clusters and servers running in the cloud. Tapis can\u0000simplify the interaction with remote cyberinfrastructure (CI), but the current\u0000services require users to specify the exact configuration of a job to run,\u0000including the system, queue, node count, and maximum run time, among other\u0000attributes. Moreover, the remote resources must be defined and configured in\u0000Tapis before a job can be submitted. In this paper, we present our efforts to\u0000develop an intelligent job scheduling capability in Tapis, where various\u0000attributes about a job configuration can be automatically determined for the\u0000user, and computational resources can be dynamically provisioned by Tapis for\u0000specific jobs. We develop an overall architecture for such a feature, which\u0000suggests a set of core challenges to be solved. Then, we focus on one such\u0000specific challenge: predicting queue times for a job on different HPC systems\u0000and queues, and we present two sets of results based on machine learning\u0000methods. Our first set of results cast the problem as a regression, which can\u0000be used to select the best system from a list of existing options. Our second\u0000set of results frames the problem as a classification, allowing us to compare\u0000the use of an existing system with a dynamically provisioned resource.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures","authors":"Mónica Chillarón, Gregorio Quintana-Ortí, Vicente Vidal, Per-Gunnar Martinsson","doi":"arxiv-2408.05238","DOIUrl":"https://doi.org/arxiv-2408.05238","url":null,"abstract":"Solving very large linear systems of equations is a key computational task in\u0000science and technology. In many cases, the coefficient matrix of the linear\u0000system is rank-deficient, leading to systems that may be underdetermined,\u0000inconsistent, or both. In such cases, one generally seeks to compute the least\u0000squares solution that minimizes the residual of the problem, which can be\u0000further defined as the solution with smallest norm in cases where the\u0000coefficient matrix has a nontrivial nullspace. This work presents several new\u0000techniques for solving least squares problems involving coefficient matrices\u0000that are so large that they do not fit in main memory. The implementations\u0000include both CPU and GPU variants. All techniques rely on complete orthogonal\u0000decompositions that guarantee that both conditions of a least squares solution\u0000are met, regardless of the rank properties of the matrix. Specifically, they\u0000rely on the recently proposed \"randUTV\" algorithm that is particularly\u0000effective in strongly communication-constrained environments. A detailed\u0000precision and performance study reveals that the new methods, that operate on\u0000data stored on disk, are competitive with state-of-the-art methods that store\u0000all data in main memory.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Billion-files File Systems (BfFS): A Comparison","authors":"Sohail Shaikh","doi":"arxiv-2408.01805","DOIUrl":"https://doi.org/arxiv-2408.01805","url":null,"abstract":"As the volume of data being produced is increasing at an exponential rate\u0000that needs to be processed quickly, it is reasonable that the data needs to be\u0000available very close to the compute devices to reduce transfer latency. Due to\u0000this need, local filesystems are getting close attention to understand their\u0000inner workings, performance, and more importantly their limitations. This study\u0000analyzes few popular Linux filesystems: EXT4, XFS, BtrFS, ZFS, and F2FS by\u0000creating, storing, and then reading back one billion files from the local\u0000filesystem. The study also captured and analyzed read/write throughput, storage\u0000blocks usage, disk space utilization and overheads, and other metrics useful\u0000for system designers and integrators. Furthermore, the study explored other\u0000side effects such as filesystem performance degradation during and after these\u0000large numbers of files and folders are created.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky
{"title":"GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS","authors":"Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, Eugene Vinitsky","doi":"arxiv-2408.01584","DOIUrl":"https://doi.org/arxiv-2408.01584","url":null,"abstract":"Multi-agent learning algorithms have been successful at generating superhuman\u0000planning in a wide variety of games but have had little impact on the design of\u0000deployed multi-agent planners. A key bottleneck in applying these techniques to\u0000multi-agent planning is that they require billions of steps of experience. To\u0000enable the study of multi-agent planning at this scale, we present GPUDrive, a\u0000GPU-accelerated, multi-agent simulator built on top of the Madrona Game Engine\u0000that can generate over a million steps of experience per second. Observation,\u0000reward, and dynamics functions are written directly in C++, allowing users to\u0000define complex, heterogeneous agent behaviors that are lowered to\u0000high-performance CUDA. We show that using GPUDrive we are able to effectively\u0000train reinforcement learning agents over many scenes in the Waymo Motion\u0000dataset, yielding highly effective goal-reaching agents in minutes for\u0000individual scenes and generally capable agents in a few hours. We ship these\u0000trained agents as part of the code base at\u0000https://github.com/Emerge-Lab/gpudrive.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and Enhancing Linux Kernel-based Packet Switching on WiFi Access Points","authors":"Shiqi Zhang, Mridul Gupta, Behnam Dezfouli","doi":"arxiv-2408.01013","DOIUrl":"https://doi.org/arxiv-2408.01013","url":null,"abstract":"As the number of WiFi devices and their traffic demands continue to rise, the\u0000need for a scalable and high-performance wireless infrastructure becomes\u0000increasingly essential. Central to this infrastructure are WiFi Access Points\u0000(APs), which facilitate packet switching between Ethernet and WiFi interfaces.\u0000Despite APs' reliance on the Linux kernel's data plane for packet switching,\u0000the detailed operations and complexities of switching packets between Ethernet\u0000and WiFi interfaces have not been investigated in existing works. This paper\u0000makes the following contributions towards filling this research gap. Through\u0000macro and micro-analysis of empirical experiments, our study reveals insights\u0000in two distinct categories. Firstly, while the kernel's statistics offer\u0000valuable insights into system operations, we identify and discuss potential\u0000pitfalls that can severely affect system analysis. For instance, we reveal the\u0000implications of device drivers on the meaning and accuracy of the statistics\u0000related to packet-switching tasks and processor utilization. Secondly, we\u0000analyze the impact of the packet switching path and core configuration on\u0000performance and power consumption. Specifically, we identify the differences in\u0000Ethernet-to-WiFi and WiFi-to-Ethernet data paths regarding processing\u0000components, multi-core utilization, and energy efficiency. We show that the\u0000WiFi-to-Ethernet data path leverages better multi-core processing and exhibits\u0000lower power consumption.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141940484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Age of Information Analysis for Multi-Priority Queue and NOMA Enabled C-V2X in IoV","authors":"Zheng Zhang, Qiong Wu, Pingyi Fan, Ke Xiong","doi":"arxiv-2408.00223","DOIUrl":"https://doi.org/arxiv-2408.00223","url":null,"abstract":"As development Internet-of-Vehicles (IoV) technology and demand for\u0000Intelligent Transportation Systems (ITS) increase, there is a growing need for\u0000real-time data and communication by vehicle users. Traditional request-based\u0000methods face challenges such as latency and bandwidth limitations. Mode 4 in\u0000Connected Vehicle-to-Everything (C-V2X) addresses latency and overhead issues\u0000through autonomous resource selection. However, Semi-Persistent Scheduling\u0000(SPS) based on distributed sensing may lead to increased collision.\u0000Non-Orthogonal Multiple Access (NOMA) can alleviate the problem of reduced\u0000packet reception probability due to collisions. Moreover, the concept of Age of\u0000Information (AoI) is introduced as a comprehensive metric reflecting\u0000reliability and latency performance, analyzing the impact of NOMA on C-V2X\u0000communication system. AoI indicates the time a message spends in both local\u0000waiting and transmission processes. In C-V2X, waiting process can be extended\u0000to queuing process, influenced by packet generation rate and Resource\u0000Reservation Interval (RRI). The transmission process is mainly affected by\u0000transmission delay and success rate. In C-V2X, a smaller selection window (SW)\u0000limits the number of available resources for vehicles, resulting in higher\u0000collision rates with increased number of vehicles. SW is generally equal to\u0000RRI, which not only affects AoI in queuing process but also AoI in the\u0000transmission process. Therefore, this paper proposes an AoI estimation method\u0000based on multi-priority data type queues and considers the influence of NOMA on\u0000the AoI generated in both processes in C-V2X system under different RRI\u0000conditions. This work aims to gain a better performance of C-V2X system\u0000comparing with some known algorithms.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"173 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141880961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Rauter, Lukas Zimmermann, Markus Zeilinger
{"title":"Accelerating Transfer Function Update for Distance Map based Volume Rendering","authors":"Michael Rauter, Lukas Zimmermann, Markus Zeilinger","doi":"arxiv-2407.21552","DOIUrl":"https://doi.org/arxiv-2407.21552","url":null,"abstract":"Direct volume rendering using ray-casting is widely used in practice. By\u0000using GPUs and applying acceleration techniques as empty space skipping, high\u0000frame rates are possible on modern hardware. This enables performance-critical\u0000use-cases such as virtual reality volume rendering. The currently fastest known\u0000technique uses volumetric distance maps to skip empty sections of the volume\u0000during ray-casting but requires the distance map to be updated per transfer\u0000function change. In this paper, we demonstrate a technique for subdividing the\u0000volume intensity range into partitions and deriving what we call partitioned\u0000distance maps. These can be used to accelerate the distance map computation for\u0000a newly changed transfer function by a factor up to 30. This allows the\u0000currently fastest known empty space skipping approach to be used while\u0000maintaining high frame rates even when the transfer function is changed\u0000frequently.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure
{"title":"In-Situ Techniques on GPU-Accelerated Data-Intensive Applications","authors":"Yi Ju, Mingshuai Li, Adalberto Perez, Laura Bellentani, Niclas Jansson, Stefano Markidis, Philipp Schlatter, Erwin Laure","doi":"arxiv-2407.20731","DOIUrl":"https://doi.org/arxiv-2407.20731","url":null,"abstract":"The computational power of High-Performance Computing (HPC) systems is\u0000constantly increasing, however, their input/output (IO) performance grows\u0000relatively slowly, and their storage capacity is also limited. This unbalance\u0000presents significant challenges for applications such as Molecular Dynamics\u0000(MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of\u0000data for further visualization or analysis. At the same time, checkpointing is\u0000crucial for long runs on HPC clusters, due to limited walltimes and/or failures\u0000of system components, and typically requires the storage of large amount of\u0000data. Thus, restricted IO performance and storage capacity can lead to\u0000bottlenecks for the performance of full application workflows (as compared to\u0000computational kernels without IO). In-situ techniques, where data is further\u0000processed while still in memory rather to write it out over the I/O subsystem,\u0000can help to tackle these problems. In contrast to traditional post-processing\u0000methods, in-situ techniques can reduce or avoid the need to write or read data\u0000via the IO subsystem. They offer a promising approach for applications aiming\u0000to leverage the full power of large scale HPC systems. In-situ techniques can\u0000also be applied to hybrid computational nodes on HPC systems consisting of\u0000graphics processing units (GPUs) and central processing units (CPUs). On one\u0000node, the GPUs would have significant performance advantages over the CPUs.\u0000Therefore, current approaches for GPU-accelerated applications often focus on\u0000maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to\u0000perform data analysis or preprocess data concurrently to the running\u0000simulation, offer a possibility to improve this underutilization.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ahmed, R. Shende, I. Perez, D. Crawl, S. Purawat, I. Altintas
{"title":"Towards an Integrated Performance Framework for Fire Science and Management Workflows","authors":"H. Ahmed, R. Shende, I. Perez, D. Crawl, S. Purawat, I. Altintas","doi":"arxiv-2407.21231","DOIUrl":"https://doi.org/arxiv-2407.21231","url":null,"abstract":"Reliable performance metrics are necessary prerequisites to building\u0000large-scale end-to-end integrated workflows for collaborative scientific\u0000research, particularly within context of use-inspired decision making platforms\u0000with many concurrent users and when computing real-time and urgent results\u0000using large data. This work is a building block for the National Data Platform,\u0000which leverages multiple use-cases including the WIFIRE Data and Model Commons\u0000for wildfire behavior modeling and the EarthScope Consortium for collaborative\u0000geophysical research. This paper presents an artificial intelligence and\u0000machine learning (AI/ML) approach to performance assessment and optimization of\u0000scientific workflows. An associated early AI/ML framework spanning performance\u0000data collection, prediction and optimization is applied to wildfire science\u0000applications within the WIFIRE BurnPro3D (BP3D) platform for proactive fire\u0000management and mitigation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"221 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Ju, Adalberto Perez, Stefano Markidis, Philipp Schlatter, Erwin Laure
{"title":"Understanding the Impact of Synchronous, Asynchronous, and Hybrid In-Situ Techniques in Computational Fluid Dynamics Applications","authors":"Yi Ju, Adalberto Perez, Stefano Markidis, Philipp Schlatter, Erwin Laure","doi":"arxiv-2407.20717","DOIUrl":"https://doi.org/arxiv-2407.20717","url":null,"abstract":"High-Performance Computing (HPC) systems provide input/output (IO)\u0000performance growing relatively slowly compared to peak computational\u0000performance and have limited storage capacity. Computational Fluid Dynamics\u0000(CFD) applications aiming to leverage the full power of Exascale HPC systems,\u0000such as the solver Nek5000, will generate massive data for further processing.\u0000These data need to be efficiently stored via the IO subsystem. However, limited\u0000IO performance and storage capacity may result in performance, and thus\u0000scientific discovery, bottlenecks. In comparison to traditional post-processing\u0000methods, in-situ techniques can reduce or avoid writing and reading the data\u0000through the IO subsystem, promising to be a solution to these problems. In this\u0000paper, we study the performance and resource usage of three in-situ use cases:\u0000data compression, image generation, and uncertainty quantification. We\u0000furthermore analyze three approaches when these in-situ tasks and the\u0000simulation are executed synchronously, asynchronously, or in a hybrid manner.\u0000In-situ compression can be used to reduce the IO time and storage requirements\u0000while maintaining data accuracy. Furthermore, in-situ visualization and\u0000analysis can save Terabytes of data from being routed through the IO subsystem\u0000to storage. However, the overall efficiency is crucially dependent on the\u0000characteristics of both, the in-situ task and the simulation. In some cases,\u0000the overhead introduced by the in-situ tasks can be substantial. Therefore, it\u0000is essential to choose the proper in-situ approach, synchronous, asynchronous,\u0000or hybrid, to minimize overhead and maximize the benefits of concurrent\u0000execution.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}