Workload Allocation for Distributed Coded Machine Learning: From Offline Model-Based to Online Model-Free

IEEE Internet of Things Magazine Pub Date : 2024-07-01 DOI:10.1109/IOTM.001.2300247

Yuxuan Jiang, Qiang Ye, E. T. Fapi, Wenting Sun, Fudong Li

{"title":"Workload Allocation for Distributed Coded Machine Learning: From Offline Model-Based to Online Model-Free","authors":"Yuxuan Jiang, Qiang Ye, E. T. Fapi, Wenting Sun, Fudong Li","doi":"10.1109/IOTM.001.2300247","DOIUrl":null,"url":null,"abstract":"Distributed machine learning (ML) is an important Internet-of-Things (IoT) application. In traditional partitioned learning (PL) paradigm, a coordinator divides a high-dimensional dataset into subsets, which are processed on IoT devices. The execution time of PL can be seriously bottlenecked by slow devices named stragglers. To mitigate the negative impact of stragglers, distributed coded machine learning (DCML) was recently proposed to inject redundancy into the subsets using coding techniques. With this redundancy, the coordinator no longer requires the processing results from all devices, but only from a subgroup, where stragglers can be eliminated. This article aims to bring the burgeoning field of DCML to the wider community. After outlining the principles of DCML, we focus on its workload allocation, which addresses the appropriate level of injected redundancy to minimize the overall execution time. We highlight the fundamental trade-off and point out two critical design choices in workload allocation: model-based versus model-free, and offline versus online. Despite the predominance of offline model-based approaches in the literature, online model-based approaches also have a wide array of use case scenarios, but remain largely unexplored. At the end of the article, we propose the first online model-free workload allocation scheme for DCML, and identify future paths and opportunities along this direction.","PeriodicalId":235472,"journal":{"name":"IEEE Internet of Things Magazine","volume":"44 1","pages":"100-106"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Magazine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOTM.001.2300247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed machine learning (ML) is an important Internet-of-Things (IoT) application. In traditional partitioned learning (PL) paradigm, a coordinator divides a high-dimensional dataset into subsets, which are processed on IoT devices. The execution time of PL can be seriously bottlenecked by slow devices named stragglers. To mitigate the negative impact of stragglers, distributed coded machine learning (DCML) was recently proposed to inject redundancy into the subsets using coding techniques. With this redundancy, the coordinator no longer requires the processing results from all devices, but only from a subgroup, where stragglers can be eliminated. This article aims to bring the burgeoning field of DCML to the wider community. After outlining the principles of DCML, we focus on its workload allocation, which addresses the appropriate level of injected redundancy to minimize the overall execution time. We highlight the fundamental trade-off and point out two critical design choices in workload allocation: model-based versus model-free, and offline versus online. Despite the predominance of offline model-based approaches in the literature, online model-based approaches also have a wide array of use case scenarios, but remain largely unexplored. At the end of the article, we propose the first online model-free workload allocation scheme for DCML, and identify future paths and opportunities along this direction.

查看原文本刊更多论文

分布式编码机器学习的工作量分配：从基于离线模型到无在线模型

分布式机器学习（ML）是一种重要的物联网（IoT）应用。在传统的分区学习（PL）模式中，协调者将高维数据集划分为若干子集，并在物联网设备上进行处理。被称为 "游离者 "的慢速设备会严重制约 PL 的执行时间。为了减轻散兵游勇的负面影响，最近有人提出了分布式编码机器学习（DCML），利用编码技术为子集注入冗余。有了这种冗余，协调器就不再需要所有设备的处理结果，而只需要一个子组的处理结果，这样就可以消除游离者。本文旨在将新兴的 DCML 领域介绍给更广泛的社区。在概述了 DCML 的原理后，我们重点讨论了其工作负载分配问题，即注入冗余的适当水平，以最大限度地减少整体执行时间。我们强调了基本权衡，并指出了工作量分配中的两个关键设计选择：基于模型与无模型，离线与在线。尽管基于模型的离线方法在文献中占主导地位，但基于模型的在线方法也有广泛的应用场景，但在很大程度上仍未被探索。在文章的最后，我们提出了首个适用于 DCML 的无模型在线工作负载分配方案，并指出了这一方向的未来发展路径和机遇。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Internet of Things Magazine

自引率

0.00%

发文量