{"title":"TCB: Accelerating Transformer Inference Services with Request Concatenation","authors":"Boqian Fu, Fahao Chen, Peng Li, Deze Zeng","doi":"10.1145/3545008.3545052","DOIUrl":"https://doi.org/10.1145/3545008.3545052","url":null,"abstract":"Transformer has dominated the field of natural language processing because of its strong capability in learning from sequential input data. In recent years, various computing and networking optimizations have been proposed for improving transformer training efficiency. However, transformer inference, as the core of many AI services, has been seldom studied. A key challenge of transformer inference is variable-length input. In order to align these input, existing work has proposed batching schemes by padding zeros, which unfortunately introduces significant computational redundancy. Moreover, existing transformer inference studies are separated from the whole serving system, where both request batching and request scheduling are critical and they have complex interaction. To fill the research gap, we propose TCB, a Transformer inference system with a novel ConcatBatching scheme as well as a jointly designed online scheduling algorithm. ConcatBatching minimizes computational redundancy by concatenating multiple requests, so that batch rows can be aligned with reduced padded zeros. Moreover, we conduct a systemic study by designing an online request scheduling algorithm aware of ConcatBatching. This scheduling algorithm needs no future request information and has provable theoretical guarantee. Experimental results show that TCB can significantly outperform state-of-the-art.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126565626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangjin Wang, Ying Li, Cheng Wang, Tong Jia, K. Chow, Yang Wen, Yaoyong Dou, Guoyao Xu, Chuanjia Hou, Jie Yao, Liping Zhang
{"title":"Characterizing Job Microarchitectural Profiles at Scale: Dataset and Analysis","authors":"Kangjin Wang, Ying Li, Cheng Wang, Tong Jia, K. Chow, Yang Wen, Yaoyong Dou, Guoyao Xu, Chuanjia Hou, Jie Yao, Liping Zhang","doi":"10.1145/3545008.3545026","DOIUrl":"https://doi.org/10.1145/3545008.3545026","url":null,"abstract":"Understanding the microarchitectural resource characteristics of datacenter jobs has become increasingly critical to guarantee the performance of jobs while improving resource utilization. Prior work studied the resource characteristics of datacenter jobs at the OS level, little reveals the deep and detailed characteristics at the microarchitecture level due to the lack of related open traces. In this paper, we provide a new open trace, AMTrace (Alibaba Microarchitecture Trace) 1, which is profiled from 8,577 high-end physical hosts from Alibaba’s datacenter by a hardware/software co-design monitoring method. AMTrace provides the microarchitectural metrics of 9.8 × 105 Linux containers with ”Per-Container-Per-Logic CPU” granularity. Different from existing open traces, AMTrace provides a new perspective to analyze the microarchitectural resource characteristics of datacenter jobs. Based on AMTrace, we first reveal the uneven resource usage of jobs among multiple logic CPUs. Then, we analyze the impact of resource contention of CPU and memory bandwidth on job performance. Finally, we analyze the job performance under different CPU provisioning modes from microarchitecture perspective. These analyses lead to constructive insights for datacenter resource management and optimization. Furthermore, we discuss possible research opportunities on AMTrace and we believe that AMTrace will inspire more exciting research on microarchitecture and resource management.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127659883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Strategies for High Performance Training of Knowledge Graph Embeddings","authors":"Anwesh Panda, Sathish S. Vadhiyar","doi":"10.1145/3545008.3545075","DOIUrl":"https://doi.org/10.1145/3545008.3545075","url":null,"abstract":"Knowledge graph embeddings (KGEs) are the low dimensional representations of entities and relations between the entities. They can be used for various downstream tasks such as triple classification, link prediction, knowledge base completion, etc. Training these embeddings for a large dataset takes a huge amount of time. This work proposes strategies to make the training of KGEs faster in a distributed memory parallel environment. The first strategy is to choose between either an all-gather or an all-reduce operation based on the sparsity of the gradient matrix. The second strategy focuses on selecting those gradient vectors which significantly contribute to the reduction in the loss. The third strategy employs gradient quantization to reduce the number of bits to be communicated. The fourth strategy proposes to split the knowledge graph triples based on relations so that inter-node communication for the gradient matrix corresponding to the relation embedding matrix is eliminated. The fifth and last strategy is to select the negative triple which the model finds difficult to classify. All the strategies are combined and this allows us to train the ComplEx Knowledge Graph Embedding (KGE) model on the FB250K dataset in 6 hours with 16 nodes when compared to 11.5 hours taken to train on the same number of nodes without applying any of the above optimizations. This reduction in training time is also accompanied by a significant improvement in Mean Reciprocal Rank (MRR) and Triple Classification Accuracy (TCA).","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Penelope: Peer-to-peer Power Management","authors":"Tapan Srivastava, Huazhe Zhang, H. Hoffmann","doi":"10.1145/3545008.3545047","DOIUrl":"https://doi.org/10.1145/3545008.3545047","url":null,"abstract":"Large scale distributed computing setups rely on power management systems to enforce tight power budgets. Existing systems use a central authority that redistributes excess power to power-hungry nodes. This central authority, however, is both a single point of failure and a critical bottleneck—especially at large scale. To address these limitations we propose Penelope, a distributed power management system which shifts power through peer-to-peer transactions, ensuring that it remains robust in faulty environments and at large scale. We implement Penelope and compare its achieved performance to SLURM, a centralized power manager, under a variety of power budgets. We find that under normal conditions SLURM and Penelope achieve almost equivalent performance; however in faulty environments, Penelope achieves 8–15% mean application performance gains over SLURM. At large scale and with increasing frequency of messages, Penelope maintains its performance in contrast to centralized approaches which degrade and become unusable.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123122667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HARL: Hierarchical Adaptive Reinforcement Learning Based Auto Scheduler for Neural Networks","authors":"Zining Zhang, Bingsheng He, Zhenjie Zhang","doi":"10.1145/3545008.3545020","DOIUrl":"https://doi.org/10.1145/3545008.3545020","url":null,"abstract":"To efficiently perform inference with neural networks, the underlying tensor programs require sufficient tuning efforts before being deployed into production environments. Usually, enormous tensor program candidates need to be sufficiently explored to find the one with the best performance. This is necessary to make the neural network products meet the high demand of real-world applications such as natural language processing, auto-driving, etc. Auto-schedulers are being developed to avoid the need for human intervention. However, due to the gigantic search space and lack of intelligent search guidance, current auto-schedulers require hours to days of tuning time to find the best-performing tensor program for the entire neural network. In this paper, we propose HARL, a reinforcement learning (RL) based auto-scheduler specifically designed for efficient tensor program exploration. HARL uses a hierarchical RL architecture in which learning-based decisions are made at all different levels of search granularity. It also automatically adjusts exploration configurations in real-time for faster performance convergence. As a result, HARL improves the tensor operator performance by 22% and the search speed by 4.3x compared to the state-of-the-art auto-scheduler. Inference performance and search speed are also significantly improved on end-to-end neural networks.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122815920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Wang, Yuchong Hu, Qian Du, D. Feng, R. Wu, Ingo He, Kevin Zhang
{"title":"Exploiting Parallelism of Disk Failure Recovery via Partial Stripe Repair for an Erasure-Coded High-Density Storage Server","authors":"Lin Wang, Yuchong Hu, Qian Du, D. Feng, R. Wu, Ingo He, Kevin Zhang","doi":"10.1145/3545008.3545014","DOIUrl":"https://doi.org/10.1145/3545008.3545014","url":null,"abstract":"High-density storage servers (HDSSes), which pack many disks into single servers, are currently used in data centers to save costs (power, cooling, etc). Erasure coding, which stripes data and provides high availability guarantees, is also commonly deployed in data centers at lower cost than replication. However, when applying erasure coding to a single HDSS, we find that erasure coding’s state-of-the-art studies that improve repair performance in parallel mainly use multiple servers’ sufficient footprint, which is yet quite limited in the single HDSS, thus leading to a memory-competition issue for disk failure recovery. In this paper, for a single HDSS, we analyze its disk failure recovery’s parallelism which exists within each stripe (intra-stripe) and between stripes (inter-stripe), observe that the intra-stripe and inter-stripe parallelisms are mutually restrictive, and explore how they affect the disk failure recovery time. Based on the observations, we propose, for the HDSS, partial stripe repair (HD-PSR) schemes which exploit parallelism in both active and passive ways for single-disk recovery. We further propose a cooperative repair strategy to improve multi-disk recovery performance. We prototype HD-PSR and show via Amazon EC2 experiments that the recovery time of a single-disk failure and a multi-disk failure can be reduced by up to 71.7% and 52.5%, respectively, over existing erasure-coded repair scheme in high-density storage.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129974735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Mena, O. Shaaban, Víctor López, Marta Garcia, P. Carpenter, E. Ayguadé, Jesús Labarta
{"title":"Transparent load balancing of MPI programs using OmpSs-2@Cluster and DLB","authors":"J. Mena, O. Shaaban, Víctor López, Marta Garcia, P. Carpenter, E. Ayguadé, Jesús Labarta","doi":"10.1145/3545008.3545045","DOIUrl":"https://doi.org/10.1145/3545008.3545045","url":null,"abstract":"Abstract Load imbalance is a long-standing source of inefficiency in high performance computing. The situation has only got worse as applications and systems increase in complexity, e.g., adaptive mesh refinement, DVFS, memory hierarchies, power and thermal management, and manufacturing processes. Load balancing is often implemented in the application, but it obscures application logic and may need extensive code refactoring. This paper presents an automated and transparent dynamic load balancing approach for MPI applications with OmpSs-2 tasks, which relieves applications from this burden. Only local and trivial changes are required to the application. Our approach exploits the ability of OmpSs-2@Cluster to offload tasks for execution on other nodes, and it reallocates compute resources among ranks using the Dynamic Load Balancing (DLB) library. It employs LeWI to react to fine-grained load imbalances and DROM to address coarse-grained load imbalances by reserving cores on other nodes that can be reclaimed on demand. We use an expander graph to limit the amount of point-to-point communication and state. The results show 46% reduction in time-to-solution for micro-scale solid mechanics on 32 nodes and a 20% reduction beyond DLB for n-body on 16 nodes, when one node is running slow. A synthetic benchmark shows that performance is within 10% of optimal for an imbalance of up to 2.0 on 8 nodes. All software is released open source.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121271581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Liu, Chun-hua Li, Zhou Zhang, Yuhan Liu, Ke Zhou, Ji Zhang
{"title":"A Data-aware Learned Index Scheme for Efficient Writes","authors":"Li Liu, Chun-hua Li, Zhou Zhang, Yuhan Liu, Ke Zhou, Ji Zhang","doi":"10.1145/3545008.3545077","DOIUrl":"https://doi.org/10.1145/3545008.3545077","url":null,"abstract":"Index structure is very important for efficient data access and system performance in the storage system. Learned index utilizes recursive index models to replace range index structure (such as B+ Tree) so as to predict the position of a lookup key in a dataset. This new paradigm greatly reduces query time and index size, however it only supports read-only workloads. Although some studies reserve gaps between keys for new data to support update, they incur high memory space and shift cost when a large number of data are inserted. In this paper, we propose a data-aware learned index scheme with high scalability, called EWALI, which constructs index models based on a lightweight data-aware data partition algorithm. When the data distribution changes, EWALI can automatically split the related leaf nodes and retrain the corresponding models to accommodate different workloads. In addition, EWALI designs an alternative duel buffers to handle new data and adopts the delayed update mechanism to merge data, greatly reducing write locking and improving write performance. We evaluate EWALI with real-world and synthetic datasets. Extensive experimental results show that EWALI reduces write latency respectively by 60.9% and 33.7% than state-of-the-art Fitting-Tree and XIndex, and achieves up to 3.1 × performance improvement in terms of range query comparing with XIndex.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121245804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aperiodic Local SGD: Beyond Local SGD","authors":"Hao Zhang, Tingting Wu, Siyao Cheng, Jie Liu","doi":"10.1145/3545008.3545013","DOIUrl":"https://doi.org/10.1145/3545008.3545013","url":null,"abstract":"Variations of stochastic gradient decedent (SGD) methods are at the core of training deep neural network models. However, in distributed deep learning, where multiple computing devices and data segments are employed in the training process, the performance of SGD can be significantly limited by the overhead of gradient communication. Local SGD methods are designed to overcome this bottleneck by averaging individual gradients trained over parallel workers after multiple local iterations. Currently, both for theoretical analyses and for practical applications, most studies employ periodic synchronization scheme by default, while few of them focus on the aperiodic schemes to obtain better performance models with limited computation and communication overhead. In this paper, we investigate local SGD with an arbitrary synchronization scheme to answer two questions: (1) Is the periodic synchronization scheme best? (2) If not, what is the optimal one? First, for any synchronization scheme, we derive the performance boundary with fixed overhead, and formulate the performance optimization under given computation and communication constraints. Then we find a succinct property of the optimal scheme that the local iteration number decreases as training continues, which indicates the periodic one is suboptimal. Furthermore, with some reasonable approximations, we obtain an explicit form of the optimal scheme and propose Aperiodic Local SGD (ALSGD) as an improved substitute for local SGD without any overhead increment. Our experiments also confirm that with the same computation and communication overhead, ALSGD outperforms local SGD in performance, especially for heterogeneous data.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121496690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyu Xia, Feifei Chen, Qiang He, Guangming Cui, John C. Grundy, Mohamed Abdelrazek, Fang Dong
{"title":"Formulating Interference-aware Data Delivery Strategies in Edge Storage Systems","authors":"Xiaoyu Xia, Feifei Chen, Qiang He, Guangming Cui, John C. Grundy, Mohamed Abdelrazek, Fang Dong","doi":"10.1145/3545008.3545078","DOIUrl":"https://doi.org/10.1145/3545008.3545078","url":null,"abstract":"Networked edge servers constitute an edge storage system in edge computing (EC). Upon users’ requests, data must be delivered from edge servers in the system or from the cloud to users. Existing studies of edge storage systems have unfortunately neglected the fact that an excessive number of users accessing the same edge server for data may impact users’ data rates seriously due to the wireless interference. Thus, users must first be allocated to edge servers properly for ensuring their data rates. After that, requested data can be delivered to users to minimize their average data delivery latency. In this paper, we formulate this Interference-aware Data Delivery at the network Edge (IDDE) problem, and demonstrate its NP-hardness. To tackle it effectively and efficiently, we propose IDDE-G, a novel approach that first finds a Nash equilibrium as the strategy for allocating users. Then, it finds an approximate strategy for delivering requested data to allocated users. We analyze the performance of IDDE-G theoretically and evaluate its performance experimentally to demonstrate the effectiveness and efficiency of IDDE-G on solving the IDDE problem.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132629953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}