Hardware-middleware system co-design for flexible training of foundation models in the cloud

Proceedings of the 23rd International Middleware Conference Extended Abstracts Pub Date : 2022-11-07 DOI:10.1145/3568161.3568317

Seetharami R. Seelam

{"title":"Hardware-middleware system co-design for flexible training of foundation models in the cloud","authors":"Seetharami R. Seelam","doi":"10.1145/3568161.3568317","DOIUrl":null,"url":null,"abstract":"Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.","PeriodicalId":436911,"journal":{"name":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3568161.3568317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.

查看原文本刊更多论文

基于云基础模型灵活训练的硬件中间件系统协同设计

基础模型是一类新的人工智能模型，它们在广泛的数据(通常是通过自我监督)上进行训练，可以用于不同的下游任务。由于自我监督和在大量未标记数据上进行训练的能力，这些模型增长到拥有数千亿个参数，它们需要在数百个GPU上花费数月时间来训练和生成基础模型。因此，人工智能系统和中间件对于以可扩展、经济高效的方式训练这些基础模型至关重要。在这次演讲中，我将讨论一个新的基于云的人工智能系统的架构，以训练大规模的基础模型。该系统完全基于开源软件堆栈构建，从管理程序到客户操作系统，从容器平台到AI框架和库。它内置在IBM Cloud平台中，硬件和软件堆栈经过优化，可以在数百个gpu上训练基础模型。我们在这个平台上以最先进的精度在最短的时间内训练了各种基础模型。我将讨论未来AI系统的架构、操作经验以及硬件和中间件协同设计方向的想法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 23rd International Middleware Conference Extended Abstracts

自引率

0.00%

发文量