{"title":"FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation","authors":"Chamath Palihawadana, Nirmalie Wiratunga, Anjana Wijekoon, Harsha Kalutarage","doi":"arxiv-2409.05242","DOIUrl":"https://doi.org/arxiv-2409.05242","url":null,"abstract":"Communication efficiency is a widely recognised research problem in Federated\u0000Learning (FL), with recent work focused on developing techniques for efficient\u0000compression, distribution and aggregation of model parameters between clients\u0000and the server. Particularly within distributed systems, it is important to\u0000balance the need for computational cost and communication efficiency. However,\u0000existing methods are often constrained to specific applications and are less\u0000generalisable. In this paper, we introduce FedFT (federated frequency-space\u0000transformation), a simple yet effective methodology for communicating model\u0000parameters in a FL setting. FedFT uses Discrete Cosine Transform (DCT) to\u0000represent model parameters in frequency space, enabling efficient compression\u0000and reducing communication overhead. FedFT is compatible with various existing\u0000FL methodologies and neural architectures, and its linear property eliminates\u0000the need for multiple transformations during federated aggregation. This\u0000methodology is vital for distributed solutions, tackling essential challenges\u0000like data privacy, interoperability, and energy efficiency inherent to these\u0000environments. We demonstrate the generalisability of the FedFT methodology on\u0000four datasets using comparative studies with three state-of-the-art FL\u0000baselines (FedAvg, FedProx, FedSim). Our results demonstrate that using FedFT\u0000to represent the differences in model parameters between communication rounds\u0000in frequency space results in a more compact representation compared to\u0000representing the entire model in frequency space. This leads to a reduction in\u0000communication overhead, while keeping accuracy levels comparable and in some\u0000cases even improving it. Our results suggest that this reduction can range from\u00005% to 30% per client, depending on dataset.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ARIM-mdx Data System: Towards a Nationwide Data Platform for Materials Science","authors":"Masatoshi Hanai, Ryo Ishikawa, Mitsuaki Kawamura, Masato Ohnishi, Norio Takenaka, Kou Nakamura, Daiju Matsumura, Seiji Fujikawa, Hiroki Sakamoto, Yukinori Ochiai, Tetsuo Okane, Shin-Ichiro Kuroki, Atsuo Yamada, Toyotaro Suzumura, Junichiro Shiomi, Kenjiro Taura, Yoshio Mita, Naoya Shibata, Yuichi Ikuhara","doi":"arxiv-2409.06734","DOIUrl":"https://doi.org/arxiv-2409.06734","url":null,"abstract":"In modern materials science, effective and high-volume data management across\u0000leading-edge experimental facilities and world-class supercomputers is\u0000indispensable for cutting-edge research. Such facilities and supercomputers are\u0000typically utilized by a wide range of researchers across different fields and\u0000organizations in academia and industry. However, existing integrated systems\u0000that handle data from these resources have primarily focused just on\u0000smaller-scale cross-institutional or single-domain operations. As a result,\u0000they often lack the scalability, efficiency, agility, and interdisciplinarity,\u0000needed for handling substantial volumes of data from various researchers. In this paper, we introduce ARIM-mdx data system, a nationwide data platform\u0000for materials science in Japan. The platform involves 8 universities and\u0000institutes all over Japan through the governmental materials science project.\u0000Currently in its trial phase, the ARIM-mdx data system is utilized by over 800\u0000researchers from around 140 organizations in academia and industry, being\u0000intended to gradually expand its reach. The system employs a hybrid\u0000architecture, combining a peta-scale dedicated storage system for security and\u0000stability with a high-performance academic cloud for efficiency and\u0000scalability. Through direct network connections between them, the system\u0000achieves 4.7x latency reduction compared to a conventional approach, resulting\u0000in near real-time interactive data analysis. It also utilizes specialized IoT\u0000devices for secure data transfer from equipment computers and connects to\u0000multiple supercomputers via an academic ultra-fast network, achieving 4x faster\u0000data transfer compared to the public Internet. The ARIM-mdx data system, as a\u0000pioneering nationwide data platform, has the potential to contribute to the\u0000creation of new research communities and accelerates innovations.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Agreement in the Arrovian Framework","authors":"Kenan Wood, Hammurabi Mendes, Jonad Pulaj","doi":"arxiv-2409.04685","DOIUrl":"https://doi.org/arxiv-2409.04685","url":null,"abstract":"Preference aggregation is a fundamental problem in voting theory, in which\u0000public input rankings of a set of alternatives (called preferences) must be\u0000aggregated into a single preference that satisfies certain soundness\u0000properties. The celebrated Arrow Impossibility Theorem is equivalent to a\u0000distributed task in a synchronous fault-free system that satisfies properties\u0000such as respecting unanimous preferences, maintaining independence of\u0000irrelevant alternatives (IIA), and non-dictatorship, along with consensus since\u0000only one preference can be decided. In this work, we study a weaker distributed task in which crash faults are\u0000introduced, IIA is not required, and the consensus property is relaxed to\u0000either $k$-set agreement or $epsilon$-approximate agreement using any metric\u0000on the set of preferences. In particular, we prove several novel impossibility\u0000results for both of these tasks in both synchronous and asynchronous\u0000distributed systems. We additionally show that the impossibility for our\u0000$epsilon$-approximate agreement task using the Kendall tau or Spearman\u0000footrule metrics holds under extremely weak assumptions.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang
{"title":"Revisiting the Time Cost Model of AllReduce","authors":"Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang","doi":"arxiv-2409.04202","DOIUrl":"https://doi.org/arxiv-2409.04202","url":null,"abstract":"AllReduce is an important and popular collective communication primitive,\u0000which has been widely used in areas such as distributed machine learning and\u0000high performance computing. To design, analyze, and choose from various\u0000algorithms and implementations of AllReduce, the time cost model plays a\u0000crucial role, and the predominant one is the $(alpha,beta,gamma)$ model. In\u0000this paper, we revisit this model, and reveal that it cannot well characterize\u0000the time cost of AllReduce on modern clusters; thus must be updated. We perform\u0000extensive measurements to identify two additional terms contributing to the\u0000time cost: the incast term and the memory access term. We augment the\u0000$(alpha,beta,gamma)$ model with these two terms, and present GenModel as a\u0000result. Using GenModel, we discover two new optimalities for AllReduce\u0000algorithms, and prove that they cannot be achieved simultaneously. Finally,\u0000striking the balance between the two new optimalities, we design GenTree, an\u0000AllReduce plan generation algorithm specialized for tree-like topologies.\u0000Experiments on a real testbed with 64 GPUs show that GenTree can achieve\u00001.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulations\u0000also confirm that GenTree can improve the state-of-the-art AllReduce algorithm\u0000by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong
{"title":"Heterogeneity-Aware Cooperative Federated Edge Learning with Adaptive Computation and Communication Compression","authors":"Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong","doi":"arxiv-2409.04022","DOIUrl":"https://doi.org/arxiv-2409.04022","url":null,"abstract":"Motivated by the drawbacks of cloud-based federated learning (FL),\u0000cooperative federated edge learning (CFEL) has been proposed to improve\u0000efficiency for FL over mobile edge networks, where multiple edge servers\u0000collaboratively coordinate the distributed model training across a large number\u0000of edge devices. However, CFEL faces critical challenges arising from dynamic\u0000and heterogeneous device properties, which slow down the convergence and\u0000increase resource consumption. This paper proposes a heterogeneity-aware CFEL\u0000scheme called textit{Heterogeneity-Aware Cooperative Edge-based Federated\u0000Averaging} (HCEF) that aims to maximize the model accuracy while minimizing the\u0000training time and energy consumption via adaptive computation and communication\u0000compression in CFEL. By theoretically analyzing how local update frequency and\u0000gradient compression affect the convergence error bound in CFEL, we develop an\u0000efficient online control algorithm for HCEF to dynamically determine local\u0000update frequencies and compression ratios for heterogeneous devices.\u0000Experimental results show that compared with prior schemes, the proposed HCEF\u0000scheme can maintain higher model accuracy while reducing training latency and\u0000improving energy efficiency simultaneously.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong
{"title":"A Hybrid Vectorized Merge Sort on ARM NEON","authors":"Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong","doi":"arxiv-2409.03970","DOIUrl":"https://doi.org/arxiv-2409.03970","url":null,"abstract":"Sorting algorithms are the most extensively researched topics in computer\u0000science and serve for numerous practical applications. Although various sorts\u0000have been proposed for efficiency, different architectures offer distinct\u0000flavors to the implementation of parallel sorting. In this paper, we propose a\u0000hybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short\u0000(NEON-MS). In detail, according to the granted register functions, we first\u0000identify the optimal register number to avoid the register-to-memory access,\u0000due to the write-back of intermediate outcomes. More importantly, following the\u0000generic merge sort framework that primarily uses sorting network for column\u0000sort and merging networks for three types of vectorized merge, we further\u0000improve their structures for high efficiency in an unified asymmetry way: 1) it\u0000makes the optimal sorting networks with few comparators become possible; 2)\u0000hybrid implementation of both serial and vectorized merges incurs the pipeline\u0000with merge instructions highly interleaved. Experiments on a single FT2000+\u0000core show that NEON-MS is 3.8 and 2.1 times faster than std::sort and\u0000boost::block_sort, respectively, on average. Additionally, as compared to the\u0000parallel version of the latter, NEON-MS gains an average speedup of 1.25.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang
{"title":"CubicML: Automated ML for Distributed ML Systems Co-design with ML Prediction of Performance","authors":"Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang","doi":"arxiv-2409.04585","DOIUrl":"https://doi.org/arxiv-2409.04585","url":null,"abstract":"Scaling up deep learning models has been proven effective to improve\u0000intelligence of machine learning (ML) models, especially for industry\u0000recommendation models and large language models. The co-design of distributed\u0000ML systems and algorithms (to maximize training performance) plays a pivotal\u0000role for its success. As it scales, the number of co-design hyper-parameters\u0000grows rapidly which brings challenges to feasibly find the optimal setup for\u0000system performance maximization. In this paper, we propose CubicML which uses\u0000ML to automatically optimize training performance of distributed ML systems. In\u0000CubicML, we use a ML model as a proxy to predict the training performance for\u0000search efficiency and performance modeling flexibility. We proved that CubicML\u0000can effectively optimize training speed of in-house ads recommendation models\u0000and large language models at Meta.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui
{"title":"Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management","authors":"Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui","doi":"arxiv-2409.03365","DOIUrl":"https://doi.org/arxiv-2409.03365","url":null,"abstract":"Recent foundation models are capable of handling multiple machine learning\u0000(ML) tasks and multiple data modalities with the unified base model structure\u0000and several specialized model components. However, the development of such\u0000multi-task (MT) multi-modal (MM) models poses significant model management\u0000challenges to existing training systems. Due to the sophisticated model\u0000architecture and the heterogeneous workloads of different ML tasks and data\u0000modalities, training these models usually requires massive GPU resources and\u0000suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of\u0000large-scale MT MM models through data heterogeneity-aware model management\u0000optimization. The key idea is to decompose the model execution into stages and\u0000address the joint optimization problem sequentially, including both\u0000heterogeneity-aware workload parallelization and dependency-driven execution\u0000scheduling. Based on this, we build a prototype system and evaluate it on\u0000various large MT MM models. Experiments demonstrate the superior performance\u0000and efficiency of our system, with speedup ratio up to 71% compared to\u0000state-of-the-art training systems.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offs","authors":"Toni Böhnlein, Pál András Papp, A. N. Yzelman","doi":"arxiv-2409.03898","DOIUrl":"https://doi.org/arxiv-2409.03898","url":null,"abstract":"The well-studied red-blue pebble game models the execution of an arbitrary\u0000computational DAG by a single processor over a two-level memory hierarchy. We\u0000present a natural generalization to a multiprocessor setting where each\u0000processor has its own limited fast memory, and all processors share unlimited\u0000slow memory. To our knowledge, this is the first thorough study that combines\u0000pebbling and DAG scheduling problems, capturing the computation of general\u0000workloads on multiple processors with memory constraints and communication\u0000costs. Our pebbling model enables us to analyze trade-offs between workload\u0000balancing, communication and memory limitations, and it captures real-world\u0000factors such as superlinear speedups due to parallelization. Our results include upper and lower bounds on the pebbling cost, an analysis\u0000of a greedy pebbling strategy, and an extension of NP-hardness results for\u0000specific DAG classes from simpler models. For our main technical contribution,\u0000we show two inapproximability results that already hold for the long-standing\u0000problem of standard red-blue pebbling: (i) the optimal I/O cost cannot be\u0000approximated to any finite factor, and (ii) the optimal total cost\u0000(I/O+computation) can only be approximated to a limited constant factor, i.e.,\u0000it does not allow for a polynomial-time approximation scheme. These results\u0000also carry over naturally to our multiprocessor pebbling model.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jayden Serenari, Sreekanth Sreekumar, Kaiwen Zhao, Saurabh Sarkar, Stephen Lee
{"title":"GreenWhisk: Emission-Aware Computing for Serverless Platform","authors":"Jayden Serenari, Sreekanth Sreekumar, Kaiwen Zhao, Saurabh Sarkar, Stephen Lee","doi":"arxiv-2409.03029","DOIUrl":"https://doi.org/arxiv-2409.03029","url":null,"abstract":"Serverless computing is an emerging cloud computing abstraction wherein the\u0000cloud platform transparently manages all resources, including explicitly\u0000provisioning resources and geographical load balancing when the demand for\u0000service spikes. Users provide code as functions, and the cloud platform runs\u0000these functions handling all aspects of function execution. While prior work\u0000has primarily focused on optimizing performance, this paper focuses on reducing\u0000the carbon footprint of these systems making variations in grid carbon\u0000intensity and intermittency from renewables transparent to the user. We\u0000introduce GreenWhisk, a carbon-aware serverless computing platform built upon\u0000Apache OpenWhisk, operating in two modes - grid-connected and grid-isolated -\u0000addressing intermittency challenges arising from renewables and the grid's\u0000carbon footprint. Moreover, we develop carbon-aware load balancing algorithms\u0000that leverage energy and carbon information to reduce the carbon footprint. Our\u0000evaluation results show that GreenWhisk can easily incorporate carbon-aware\u0000algorithms, thereby reducing the carbon footprint of functions without\u0000significantly impacting the performance of function execution. In doing so, our\u0000system design enables the integration of new carbon-aware strategies into a\u0000serverless computing platform.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}