{"title":"Hardware-middleware system co-design for flexible training of foundation models in the cloud","authors":"Seetharami R. Seelam","doi":"10.1145/3568161.3568317","DOIUrl":null,"url":null,"abstract":"Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.","PeriodicalId":436911,"journal":{"name":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3568161.3568317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.