技术视角:通过压缩线性代数扩展机器学习

SIGMOD Rec. Pub Date : 2017-05-12 DOI:10.1145/3093754.3093764

Z. Ives

{"title":"技术视角:通过压缩线性代数扩展机器学习","authors":"Z. Ives","doi":"10.1145/3093754.3093764","DOIUrl":null,"url":null,"abstract":"Demand for more powerful “big data analytics” solutions has spurred a great deal of interest in the core programming models, abstractions, and platforms for next-generation systems. For these problems, a complete solution would address data wrangling and processing, and support analytics over data of any modality or scale. It would support a wide array of machine learning algorithms, but also provide primitives for building new ones. It should be customizable, scale to vast volumes of data, and map to modern multicore, GPU, co-processor, and compute cluster hardware. In pursuit of these goals, novel techniques and solutions are being developed by machine learning researchers (e.g., high-performance libraries like Theano [6], runtime systems like GraphLab [5]), in the database and distributed systems research communities (e.g., distributed data analytics engines like Spark [7] and Flink [3]), and in industry by major technology players (e.g., Google’s TensorFlow [1] and IBM/Apache’s SystemML [4]). These libraries and platforms support multiple development languages, provide abstract datatypes for machine learning over data, and include compilers and runtime systems optimized for distributed execution on modern hardware. The database community excels in developing techniques for cost-estimating and optimizing declarative programs, and in exploiting data independence to optimize data placement and layout for performance. Elgohary et al’s work on “Scaling Machine Learning via Compressed Linear Algebra,”which appeared in the Proceedings of the VLDB Endowment [2], was conducted within IBM and Apache’s SystemML declarative machine learning project. It shows just how e↵ective such database techniques can be in a machine learning setting. The authors observe that the core data objects in machine learning – feature matrices, weight vectors – tend to have repeated values as well as regular structure, and may be quite large. Machine learning tasks over such data are composed from lower-level linear algebra operations. Such operations generally involve repeated floating-point computation that today are bandwidth-limited, by the ability of the CPU to traverse large matrices in RAM. The authors’ solution is to develop a compressed representation for matrices, as well as compressed linear algebra operations that work directly over the compressed matrix data. Together, these reduce the bandwidth required to perform the same computations, thus speeding performance dramatically. The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation. The paper makes a number of key contributions. First, the authors identify a set of linear algebra primitives shared by multiple distributed machine learning platforms and algorithms. Second, they develop compression techniques not only for single columns in a matrix, but also “column grouping” techniques that capitalize on correlations between columns. They show that o↵set lists and run-length encoding o↵er a good set of trade-o↵s between e ciency and performance. Third, the paper develops cache-conscious algorithms for matrix multiplication and other operations. Finally, the paper shows how to estimate the sizes of compressed matrices and to choose an e↵ective compression strategy. Together, these techniques illustrate how database systems concepts can be adapted to great benefit in the machine learning space.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"69 1","pages":"41"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Technical Perspective: Scaling Machine Learning via Compressed Linear Algebra\",\"authors\":\"Z. Ives\",\"doi\":\"10.1145/3093754.3093764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Demand for more powerful “big data analytics” solutions has spurred a great deal of interest in the core programming models, abstractions, and platforms for next-generation systems. For these problems, a complete solution would address data wrangling and processing, and support analytics over data of any modality or scale. It would support a wide array of machine learning algorithms, but also provide primitives for building new ones. It should be customizable, scale to vast volumes of data, and map to modern multicore, GPU, co-processor, and compute cluster hardware. In pursuit of these goals, novel techniques and solutions are being developed by machine learning researchers (e.g., high-performance libraries like Theano [6], runtime systems like GraphLab [5]), in the database and distributed systems research communities (e.g., distributed data analytics engines like Spark [7] and Flink [3]), and in industry by major technology players (e.g., Google’s TensorFlow [1] and IBM/Apache’s SystemML [4]). These libraries and platforms support multiple development languages, provide abstract datatypes for machine learning over data, and include compilers and runtime systems optimized for distributed execution on modern hardware. The database community excels in developing techniques for cost-estimating and optimizing declarative programs, and in exploiting data independence to optimize data placement and layout for performance. Elgohary et al’s work on “Scaling Machine Learning via Compressed Linear Algebra,”which appeared in the Proceedings of the VLDB Endowment [2], was conducted within IBM and Apache’s SystemML declarative machine learning project. It shows just how e↵ective such database techniques can be in a machine learning setting. The authors observe that the core data objects in machine learning – feature matrices, weight vectors – tend to have repeated values as well as regular structure, and may be quite large. Machine learning tasks over such data are composed from lower-level linear algebra operations. Such operations generally involve repeated floating-point computation that today are bandwidth-limited, by the ability of the CPU to traverse large matrices in RAM. The authors’ solution is to develop a compressed representation for matrices, as well as compressed linear algebra operations that work directly over the compressed matrix data. Together, these reduce the bandwidth required to perform the same computations, thus speeding performance dramatically. The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation. The paper makes a number of key contributions. First, the authors identify a set of linear algebra primitives shared by multiple distributed machine learning platforms and algorithms. Second, they develop compression techniques not only for single columns in a matrix, but also “column grouping” techniques that capitalize on correlations between columns. They show that o↵set lists and run-length encoding o↵er a good set of trade-o↵s between e ciency and performance. Third, the paper develops cache-conscious algorithms for matrix multiplication and other operations. Finally, the paper shows how to estimate the sizes of compressed matrices and to choose an e↵ective compression strategy. Together, these techniques illustrate how database systems concepts can be adapted to great benefit in the machine learning space.\",\"PeriodicalId\":21740,\"journal\":{\"name\":\"SIGMOD Rec.\",\"volume\":\"69 1\",\"pages\":\"41\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIGMOD Rec.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3093754.3093764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3093754.3093764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对更强大的“大数据分析”解决方案的需求激发了人们对下一代系统的核心编程模型、抽象和平台的极大兴趣。对于这些问题，一个完整的解决方案应该能够处理数据整理和处理，并支持对任何形式或规模的数据进行分析。它将支持广泛的机器学习算法，但也为构建新算法提供了原语。它应该是可定制的，可扩展到大量数据，并映射到现代多核、GPU、协处理器和计算集群硬件。为了实现这些目标，机器学习研究人员正在开发新的技术和解决方案(例如，高性能库，如Theano[6]，运行时系统，如GraphLab[5])，数据库和分布式系统研究社区(例如，分布式数据分析引擎，如Spark[7]和Flink[3])，以及工业界的主要技术参与者(例如，谷歌的TensorFlow[1]和IBM/Apache的SystemML[4])。这些库和平台支持多种开发语言，为机器学习提供抽象数据类型，并包括针对现代硬件上分布式执行优化的编译器和运行时系统。数据库社区擅长开发成本估算和优化声明性程序的技术，擅长利用数据独立性来优化数据放置和布局以提高性能。elgohari等人关于“通过压缩线性代数扩展机器学习”的研究发表在VLDB基金会的论文集[2]中，该研究是在IBM和Apache的SystemML声明性机器学习项目中进行的。它显示了这种数据库技术在机器学习环境中是多么有效。作者观察到，机器学习中的核心数据对象——特征矩阵、权重向量——往往具有重复的值和规则的结构，并且可能相当大。这些数据上的机器学习任务由低级线性代数运算组成。此类操作通常涉及重复的浮点计算，由于CPU遍历RAM中的大型矩阵的能力，这些计算目前受到带宽限制。作者的解决方案是开发矩阵的压缩表示，以及直接在压缩矩阵数据上工作的压缩线性代数操作。它们共同减少了执行相同计算所需的带宽，从而大大提高了性能。本文巧妙地采用了最初在关系数据库系统中发展起来的思想——面向列的压缩、基于采样的成本估计、压缩速度和压缩率之间的折衷——来构建一个优雅的实现。这篇论文做出了一些重要贡献。首先，作者确定了一组由多个分布式机器学习平台和算法共享的线性代数原语。其次，他们不仅为矩阵中的单个列开发了压缩技术，而且还开发了利用列之间的相关性的“列分组”技术。研究结果表明，0(0)集合列表和0(0)游程编码在效率和性能之间形成了一组良好的折衷关系。第三，本文开发了矩阵乘法和其他运算的缓存敏感算法。最后，本文展示了如何估计压缩矩阵的大小和选择有效的压缩策略。总之，这些技术说明了如何将数据库系统概念应用于机器学习领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Technical Perspective: Scaling Machine Learning via Compressed Linear Algebra

Demand for more powerful “big data analytics” solutions has spurred a great deal of interest in the core programming models, abstractions, and platforms for next-generation systems. For these problems, a complete solution would address data wrangling and processing, and support analytics over data of any modality or scale. It would support a wide array of machine learning algorithms, but also provide primitives for building new ones. It should be customizable, scale to vast volumes of data, and map to modern multicore, GPU, co-processor, and compute cluster hardware. In pursuit of these goals, novel techniques and solutions are being developed by machine learning researchers (e.g., high-performance libraries like Theano [6], runtime systems like GraphLab [5]), in the database and distributed systems research communities (e.g., distributed data analytics engines like Spark [7] and Flink [3]), and in industry by major technology players (e.g., Google’s TensorFlow [1] and IBM/Apache’s SystemML [4]). These libraries and platforms support multiple development languages, provide abstract datatypes for machine learning over data, and include compilers and runtime systems optimized for distributed execution on modern hardware. The database community excels in developing techniques for cost-estimating and optimizing declarative programs, and in exploiting data independence to optimize data placement and layout for performance. Elgohary et al’s work on “Scaling Machine Learning via Compressed Linear Algebra,”which appeared in the Proceedings of the VLDB Endowment [2], was conducted within IBM and Apache’s SystemML declarative machine learning project. It shows just how e↵ective such database techniques can be in a machine learning setting. The authors observe that the core data objects in machine learning – feature matrices, weight vectors – tend to have repeated values as well as regular structure, and may be quite large. Machine learning tasks over such data are composed from lower-level linear algebra operations. Such operations generally involve repeated floating-point computation that today are bandwidth-limited, by the ability of the CPU to traverse large matrices in RAM. The authors’ solution is to develop a compressed representation for matrices, as well as compressed linear algebra operations that work directly over the compressed matrix data. Together, these reduce the bandwidth required to perform the same computations, thus speeding performance dramatically. The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation. The paper makes a number of key contributions. First, the authors identify a set of linear algebra primitives shared by multiple distributed machine learning platforms and algorithms. Second, they develop compression techniques not only for single columns in a matrix, but also “column grouping” techniques that capitalize on correlations between columns. They show that o↵set lists and run-length encoding o↵er a good set of trade-o↵s between e ciency and performance. Third, the paper develops cache-conscious algorithms for matrix multiplication and other operations. Finally, the paper shows how to estimate the sizes of compressed matrices and to choose an e↵ective compression strategy. Together, these techniques illustrate how database systems concepts can be adapted to great benefit in the machine learning space.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIGMOD Rec.

自引率

0.00%

发文量