分布式机器学习

International Conference of Distributed Computing and Networking Pub Date : 2024-01-04 DOI:10.1145/3631461.3632516

Bapi Chatterjee

{"title":"分布式机器学习","authors":"Bapi Chatterjee","doi":"10.1145/3631461.3632516","DOIUrl":null,"url":null,"abstract":"The Web search ranking task has become increasingly important due to the rapid growth of the internet. With the growth of the Web and the number of Web search users, the amount of available training data for learning Web ranking models has also increased. We investigate the problem of learning to rank on a cluster using Web search data composed of 140,000 queries and approximately fourteen million URLs. For datasets much larger than this, distributed computing will become essential, due to both speed and memory constraints. We compare to a baseline algorithm that has been carefully engineered to allow training on the full dataset using a single machine, in order to evaluate the loss or gain incurred by the distributed algorithms we consider. The underlying algorithm we use is a boosted tree ranking algorithm called LambdaMART, where a split at a given vertex in each decision tree is determined by the split criterion for a particular feature. Our contributions are two-fold. First, we implement a method for improving the speed of training when the training data fits in main memory on a single machine by distributing the vertex split computations of the decision trees. The model produced is equivalent to the model produced from centralized training, but achieves faster training times. Second, we develop a training method for the case where the training data size exceeds the main memory of a single machine. Our second approach easily scales to far larger datasets, i.e., billions of examples, and is based on data distribution. Results of our methods on a real-world Web dataset indicate significant improvements in training speed. 4 Large-scale Learning to Rank using Boosted Decision Trees","PeriodicalId":368371,"journal":{"name":"International Conference of Distributed Computing and Networking","volume":"27 s79","pages":"4-7"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed Machine Learning\",\"authors\":\"Bapi Chatterjee\",\"doi\":\"10.1145/3631461.3632516\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Web search ranking task has become increasingly important due to the rapid growth of the internet. With the growth of the Web and the number of Web search users, the amount of available training data for learning Web ranking models has also increased. We investigate the problem of learning to rank on a cluster using Web search data composed of 140,000 queries and approximately fourteen million URLs. For datasets much larger than this, distributed computing will become essential, due to both speed and memory constraints. We compare to a baseline algorithm that has been carefully engineered to allow training on the full dataset using a single machine, in order to evaluate the loss or gain incurred by the distributed algorithms we consider. The underlying algorithm we use is a boosted tree ranking algorithm called LambdaMART, where a split at a given vertex in each decision tree is determined by the split criterion for a particular feature. Our contributions are two-fold. First, we implement a method for improving the speed of training when the training data fits in main memory on a single machine by distributing the vertex split computations of the decision trees. The model produced is equivalent to the model produced from centralized training, but achieves faster training times. Second, we develop a training method for the case where the training data size exceeds the main memory of a single machine. Our second approach easily scales to far larger datasets, i.e., billions of examples, and is based on data distribution. Results of our methods on a real-world Web dataset indicate significant improvements in training speed. 4 Large-scale Learning to Rank using Boosted Decision Trees\",\"PeriodicalId\":368371,\"journal\":{\"name\":\"International Conference of Distributed Computing and Networking\",\"volume\":\"27 s79\",\"pages\":\"4-7\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference of Distributed Computing and Networking\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3631461.3632516\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference of Distributed Computing and Networking","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3631461.3632516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着互联网的快速发展，网络搜索排名任务变得越来越重要。随着网络和网络搜索用户数量的增长，用于学习网络排名模型的可用训练数据量也在增加。我们使用由 140,000 次查询和大约 1400 万个 URL 组成的网络搜索数据，研究了在一个集群上学习排名的问题。对于远大于此规模的数据集，由于速度和内存的限制，分布式计算将变得至关重要。为了评估我们所考虑的分布式计算算法的得失，我们将该算法与经过精心设计的基线算法进行了比较，该基线算法允许使用单机在完整数据集上进行训练。我们使用的基础算法是一种名为 LambdaMART 的提升树排序算法，其中每个决策树中给定顶点的分割由特定特征的分割标准决定。我们的贡献有两方面。首先，我们实现了一种方法，通过分配决策树的顶点分割计算，在训练数据适合单机主内存的情况下提高训练速度。生成的模型等同于集中训练生成的模型，但训练时间更短。其次，我们针对训练数据规模超过单机主内存的情况开发了一种训练方法。我们的第二种方法基于数据分布，可轻松扩展到更大的数据集，即数十亿个示例。我们的方法在实际网络数据集上的结果表明，训练速度有了显著提高。4 利用提升决策树进行大规模排名学习

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distributed Machine Learning

The Web search ranking task has become increasingly important due to the rapid growth of the internet. With the growth of the Web and the number of Web search users, the amount of available training data for learning Web ranking models has also increased. We investigate the problem of learning to rank on a cluster using Web search data composed of 140,000 queries and approximately fourteen million URLs. For datasets much larger than this, distributed computing will become essential, due to both speed and memory constraints. We compare to a baseline algorithm that has been carefully engineered to allow training on the full dataset using a single machine, in order to evaluate the loss or gain incurred by the distributed algorithms we consider. The underlying algorithm we use is a boosted tree ranking algorithm called LambdaMART, where a split at a given vertex in each decision tree is determined by the split criterion for a particular feature. Our contributions are two-fold. First, we implement a method for improving the speed of training when the training data fits in main memory on a single machine by distributing the vertex split computations of the decision trees. The model produced is equivalent to the model produced from centralized training, but achieves faster training times. Second, we develop a training method for the case where the training data size exceeds the main memory of a single machine. Our second approach easily scales to far larger datasets, i.e., billions of examples, and is based on data distribution. Results of our methods on a real-world Web dataset indicate significant improvements in training speed. 4 Large-scale Learning to Rank using Boosted Decision Trees

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference of Distributed Computing and Networking

自引率

0.00%

发文量