{"title":"TCB:使用请求连接加速变压器推理服务","authors":"Boqian Fu, Fahao Chen, Peng Li, Deze Zeng","doi":"10.1145/3545008.3545052","DOIUrl":null,"url":null,"abstract":"Transformer has dominated the field of natural language processing because of its strong capability in learning from sequential input data. In recent years, various computing and networking optimizations have been proposed for improving transformer training efficiency. However, transformer inference, as the core of many AI services, has been seldom studied. A key challenge of transformer inference is variable-length input. In order to align these input, existing work has proposed batching schemes by padding zeros, which unfortunately introduces significant computational redundancy. Moreover, existing transformer inference studies are separated from the whole serving system, where both request batching and request scheduling are critical and they have complex interaction. To fill the research gap, we propose TCB, a Transformer inference system with a novel ConcatBatching scheme as well as a jointly designed online scheduling algorithm. ConcatBatching minimizes computational redundancy by concatenating multiple requests, so that batch rows can be aligned with reduced padded zeros. Moreover, we conduct a systemic study by designing an online request scheduling algorithm aware of ConcatBatching. This scheduling algorithm needs no future request information and has provable theoretical guarantee. Experimental results show that TCB can significantly outperform state-of-the-art.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"TCB: Accelerating Transformer Inference Services with Request Concatenation\",\"authors\":\"Boqian Fu, Fahao Chen, Peng Li, Deze Zeng\",\"doi\":\"10.1145/3545008.3545052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer has dominated the field of natural language processing because of its strong capability in learning from sequential input data. In recent years, various computing and networking optimizations have been proposed for improving transformer training efficiency. However, transformer inference, as the core of many AI services, has been seldom studied. A key challenge of transformer inference is variable-length input. In order to align these input, existing work has proposed batching schemes by padding zeros, which unfortunately introduces significant computational redundancy. Moreover, existing transformer inference studies are separated from the whole serving system, where both request batching and request scheduling are critical and they have complex interaction. To fill the research gap, we propose TCB, a Transformer inference system with a novel ConcatBatching scheme as well as a jointly designed online scheduling algorithm. ConcatBatching minimizes computational redundancy by concatenating multiple requests, so that batch rows can be aligned with reduced padded zeros. Moreover, we conduct a systemic study by designing an online request scheduling algorithm aware of ConcatBatching. This scheduling algorithm needs no future request information and has provable theoretical guarantee. Experimental results show that TCB can significantly outperform state-of-the-art.\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TCB: Accelerating Transformer Inference Services with Request Concatenation
Transformer has dominated the field of natural language processing because of its strong capability in learning from sequential input data. In recent years, various computing and networking optimizations have been proposed for improving transformer training efficiency. However, transformer inference, as the core of many AI services, has been seldom studied. A key challenge of transformer inference is variable-length input. In order to align these input, existing work has proposed batching schemes by padding zeros, which unfortunately introduces significant computational redundancy. Moreover, existing transformer inference studies are separated from the whole serving system, where both request batching and request scheduling are critical and they have complex interaction. To fill the research gap, we propose TCB, a Transformer inference system with a novel ConcatBatching scheme as well as a jointly designed online scheduling algorithm. ConcatBatching minimizes computational redundancy by concatenating multiple requests, so that batch rows can be aligned with reduced padded zeros. Moreover, we conduct a systemic study by designing an online request scheduling algorithm aware of ConcatBatching. This scheduling algorithm needs no future request information and has provable theoretical guarantee. Experimental results show that TCB can significantly outperform state-of-the-art.