Jiawei Liu , Bing Dong , Weilong Wang , Muhan Yuan , Borui Li , Zhao-Dong Xu , Shuai Wang
{"title":"nnWeb:通过自动协作卸载实现基于webgpu的高效DNN推理","authors":"Jiawei Liu , Bing Dong , Weilong Wang , Muhan Yuan , Borui Li , Zhao-Dong Xu , Shuai Wang","doi":"10.1016/j.comnet.2025.111489","DOIUrl":null,"url":null,"abstract":"<div><div>In-browser neural network inference offers the promise of cross-platform AI applications, but faces severe latency and energy challenges on resource-constrained devices. In this paper, we present nnWeb, a WebGPU-based in-browser neural network inference framework with optimized latency and energy efficiency. nnWeb dynamically partitions neural network and facilitates the collaborative offloading between client browser and server. nnWeb operates in two phases: (1) <em>layer-wise isolation-based profiling</em>, which is used to predict per-layer execution latency and energy on heterogeneous hardware; and (2) <em>asynchronous execution-based DNN partitioning,</em> which continuously monitors network bandwidth and device load to select the optimal partition point using WebGPU’s native pipeline parallelism, minimizing total latency or energy consumption by solving a closed-form optimization at runtime. Extensive evaluation on various in-browser AI models and networking conditions shows that nnWeb achieves an average reduction of 30% to 52% in total inference latency compared with static partitioning. Moreover, nnWeb realizes energy savings ranging from 11.3% to 44.0% in contrast to standalone browser inference.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"270 ","pages":"Article 111489"},"PeriodicalIF":4.6000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"nnWeb: Towards efficient WebGPU-based DNN inference via automatic collaborative offloading\",\"authors\":\"Jiawei Liu , Bing Dong , Weilong Wang , Muhan Yuan , Borui Li , Zhao-Dong Xu , Shuai Wang\",\"doi\":\"10.1016/j.comnet.2025.111489\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In-browser neural network inference offers the promise of cross-platform AI applications, but faces severe latency and energy challenges on resource-constrained devices. In this paper, we present nnWeb, a WebGPU-based in-browser neural network inference framework with optimized latency and energy efficiency. nnWeb dynamically partitions neural network and facilitates the collaborative offloading between client browser and server. nnWeb operates in two phases: (1) <em>layer-wise isolation-based profiling</em>, which is used to predict per-layer execution latency and energy on heterogeneous hardware; and (2) <em>asynchronous execution-based DNN partitioning,</em> which continuously monitors network bandwidth and device load to select the optimal partition point using WebGPU’s native pipeline parallelism, minimizing total latency or energy consumption by solving a closed-form optimization at runtime. Extensive evaluation on various in-browser AI models and networking conditions shows that nnWeb achieves an average reduction of 30% to 52% in total inference latency compared with static partitioning. Moreover, nnWeb realizes energy savings ranging from 11.3% to 44.0% in contrast to standalone browser inference.</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"270 \",\"pages\":\"Article 111489\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128625004566\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625004566","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
nnWeb: Towards efficient WebGPU-based DNN inference via automatic collaborative offloading
In-browser neural network inference offers the promise of cross-platform AI applications, but faces severe latency and energy challenges on resource-constrained devices. In this paper, we present nnWeb, a WebGPU-based in-browser neural network inference framework with optimized latency and energy efficiency. nnWeb dynamically partitions neural network and facilitates the collaborative offloading between client browser and server. nnWeb operates in two phases: (1) layer-wise isolation-based profiling, which is used to predict per-layer execution latency and energy on heterogeneous hardware; and (2) asynchronous execution-based DNN partitioning, which continuously monitors network bandwidth and device load to select the optimal partition point using WebGPU’s native pipeline parallelism, minimizing total latency or energy consumption by solving a closed-form optimization at runtime. Extensive evaluation on various in-browser AI models and networking conditions shows that nnWeb achieves an average reduction of 30% to 52% in total inference latency compared with static partitioning. Moreover, nnWeb realizes energy savings ranging from 11.3% to 44.0% in contrast to standalone browser inference.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.