Cody Hao Yu, Xingjian Shi, Haichen Shen, Zhi Chen, Mu Li, Yida Wang
{"title":"Lorien: Efficient Deep Learning Workloads Delivery","authors":"Cody Hao Yu, Xingjian Shi, Haichen Shen, Zhi Chen, Mu Li, Yida Wang","doi":"10.1145/3472883.3486973","DOIUrl":null,"url":null,"abstract":"Modern deep learning systems embrace the compilation idea to self generate code of a deep learning model to catch up the rapidly changed deep learning operators and newly emerged hardware platforms. The performance of the self-generated code is guaranteed via auto-tuning frameworks which normally take a long time to find proper execution schedules for the given operators, which hurts both user experiences and time-to-the-market in terms of model developments and deployments. To efficiently deliver a high-performance schedule upon requests, in this paper, we present Lorien, an open source infrastructure, to tune the operators and orchestrate the tuned schedules in a systematic way. Lorien is designed to be extensible to state-of-the-art auto-tuning frameworks, and scalable to coordinate a number of compute resources for its tuning tasks with fault tolerance. We leveraged Lorien to extract thousands of operator-level tuning tasks from 29 widely-used models in Gluon CV model zoo [22], and tune them on x86 CPU, ARM CPU, and NVIDIA GPU to construct a database for queries. In addition, to deliver reasonably high performance schedules for unseen workloads in seconds or minutes, Lorien integrates an AutoML solution to train a performance cost model with collected large-scale datasets. Our evaluation shows that the AutoML-based solution is accurate enough to enable zero-shot tuning, which does not fine-tune the cost model during tuning nor perform on-device measurements, and is able to find decent schedules with at least 10x less time than existing auto-tuning frameworks.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"17 3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3472883.3486973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Modern deep learning systems embrace the compilation idea to self generate code of a deep learning model to catch up the rapidly changed deep learning operators and newly emerged hardware platforms. The performance of the self-generated code is guaranteed via auto-tuning frameworks which normally take a long time to find proper execution schedules for the given operators, which hurts both user experiences and time-to-the-market in terms of model developments and deployments. To efficiently deliver a high-performance schedule upon requests, in this paper, we present Lorien, an open source infrastructure, to tune the operators and orchestrate the tuned schedules in a systematic way. Lorien is designed to be extensible to state-of-the-art auto-tuning frameworks, and scalable to coordinate a number of compute resources for its tuning tasks with fault tolerance. We leveraged Lorien to extract thousands of operator-level tuning tasks from 29 widely-used models in Gluon CV model zoo [22], and tune them on x86 CPU, ARM CPU, and NVIDIA GPU to construct a database for queries. In addition, to deliver reasonably high performance schedules for unseen workloads in seconds or minutes, Lorien integrates an AutoML solution to train a performance cost model with collected large-scale datasets. Our evaluation shows that the AutoML-based solution is accurate enough to enable zero-shot tuning, which does not fine-tune the cost model during tuning nor perform on-device measurements, and is able to find decent schedules with at least 10x less time than existing auto-tuning frameworks.