{"title":"技术视角:通过压缩线性代数扩展机器学习","authors":"Z. Ives","doi":"10.1145/3093754.3093764","DOIUrl":null,"url":null,"abstract":"Demand for more powerful “big data analytics” solutions has spurred a great deal of interest in the core programming models, abstractions, and platforms for next-generation systems. For these problems, a complete solution would address data wrangling and processing, and support analytics over data of any modality or scale. It would support a wide array of machine learning algorithms, but also provide primitives for building new ones. It should be customizable, scale to vast volumes of data, and map to modern multicore, GPU, co-processor, and compute cluster hardware. In pursuit of these goals, novel techniques and solutions are being developed by machine learning researchers (e.g., high-performance libraries like Theano [6], runtime systems like GraphLab [5]), in the database and distributed systems research communities (e.g., distributed data analytics engines like Spark [7] and Flink [3]), and in industry by major technology players (e.g., Google’s TensorFlow [1] and IBM/Apache’s SystemML [4]). These libraries and platforms support multiple development languages, provide abstract datatypes for machine learning over data, and include compilers and runtime systems optimized for distributed execution on modern hardware. The database community excels in developing techniques for cost-estimating and optimizing declarative programs, and in exploiting data independence to optimize data placement and layout for performance. Elgohary et al’s work on “Scaling Machine Learning via Compressed Linear Algebra,”which appeared in the Proceedings of the VLDB Endowment [2], was conducted within IBM and Apache’s SystemML declarative machine learning project. It shows just how e↵ective such database techniques can be in a machine learning setting. The authors observe that the core data objects in machine learning – feature matrices, weight vectors – tend to have repeated values as well as regular structure, and may be quite large. Machine learning tasks over such data are composed from lower-level linear algebra operations. Such operations generally involve repeated floating-point computation that today are bandwidth-limited, by the ability of the CPU to traverse large matrices in RAM. The authors’ solution is to develop a compressed representation for matrices, as well as compressed linear algebra operations that work directly over the compressed matrix data. Together, these reduce the bandwidth required to perform the same computations, thus speeding performance dramatically. The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation. The paper makes a number of key contributions. First, the authors identify a set of linear algebra primitives shared by multiple distributed machine learning platforms and algorithms. Second, they develop compression techniques not only for single columns in a matrix, but also “column grouping” techniques that capitalize on correlations between columns. They show that o↵set lists and run-length encoding o↵er a good set of trade-o↵s between e ciency and performance. Third, the paper develops cache-conscious algorithms for matrix multiplication and other operations. Finally, the paper shows how to estimate the sizes of compressed matrices and to choose an e↵ective compression strategy. Together, these techniques illustrate how database systems concepts can be adapted to great benefit in the machine learning space.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"69 1","pages":"41"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Technical Perspective: Scaling Machine Learning via Compressed Linear Algebra\",\"authors\":\"Z. Ives\",\"doi\":\"10.1145/3093754.3093764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Demand for more powerful “big data analytics” solutions has spurred a great deal of interest in the core programming models, abstractions, and platforms for next-generation systems. For these problems, a complete solution would address data wrangling and processing, and support analytics over data of any modality or scale. It would support a wide array of machine learning algorithms, but also provide primitives for building new ones. It should be customizable, scale to vast volumes of data, and map to modern multicore, GPU, co-processor, and compute cluster hardware. In pursuit of these goals, novel techniques and solutions are being developed by machine learning researchers (e.g., high-performance libraries like Theano [6], runtime systems like GraphLab [5]), in the database and distributed systems research communities (e.g., distributed data analytics engines like Spark [7] and Flink [3]), and in industry by major technology players (e.g., Google’s TensorFlow [1] and IBM/Apache’s SystemML [4]). These libraries and platforms support multiple development languages, provide abstract datatypes for machine learning over data, and include compilers and runtime systems optimized for distributed execution on modern hardware. The database community excels in developing techniques for cost-estimating and optimizing declarative programs, and in exploiting data independence to optimize data placement and layout for performance. Elgohary et al’s work on “Scaling Machine Learning via Compressed Linear Algebra,”which appeared in the Proceedings of the VLDB Endowment [2], was conducted within IBM and Apache’s SystemML declarative machine learning project. It shows just how e↵ective such database techniques can be in a machine learning setting. The authors observe that the core data objects in machine learning – feature matrices, weight vectors – tend to have repeated values as well as regular structure, and may be quite large. Machine learning tasks over such data are composed from lower-level linear algebra operations. Such operations generally involve repeated floating-point computation that today are bandwidth-limited, by the ability of the CPU to traverse large matrices in RAM. The authors’ solution is to develop a compressed representation for matrices, as well as compressed linear algebra operations that work directly over the compressed matrix data. Together, these reduce the bandwidth required to perform the same computations, thus speeding performance dramatically. The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation. The paper makes a number of key contributions. First, the authors identify a set of linear algebra primitives shared by multiple distributed machine learning platforms and algorithms. Second, they develop compression techniques not only for single columns in a matrix, but also “column grouping” techniques that capitalize on correlations between columns. They show that o↵set lists and run-length encoding o↵er a good set of trade-o↵s between e ciency and performance. Third, the paper develops cache-conscious algorithms for matrix multiplication and other operations. Finally, the paper shows how to estimate the sizes of compressed matrices and to choose an e↵ective compression strategy. Together, these techniques illustrate how database systems concepts can be adapted to great benefit in the machine learning space.\",\"PeriodicalId\":21740,\"journal\":{\"name\":\"SIGMOD Rec.\",\"volume\":\"69 1\",\"pages\":\"41\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIGMOD Rec.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3093754.3093764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3093754.3093764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Technical Perspective: Scaling Machine Learning via Compressed Linear Algebra
Demand for more powerful “big data analytics” solutions has spurred a great deal of interest in the core programming models, abstractions, and platforms for next-generation systems. For these problems, a complete solution would address data wrangling and processing, and support analytics over data of any modality or scale. It would support a wide array of machine learning algorithms, but also provide primitives for building new ones. It should be customizable, scale to vast volumes of data, and map to modern multicore, GPU, co-processor, and compute cluster hardware. In pursuit of these goals, novel techniques and solutions are being developed by machine learning researchers (e.g., high-performance libraries like Theano [6], runtime systems like GraphLab [5]), in the database and distributed systems research communities (e.g., distributed data analytics engines like Spark [7] and Flink [3]), and in industry by major technology players (e.g., Google’s TensorFlow [1] and IBM/Apache’s SystemML [4]). These libraries and platforms support multiple development languages, provide abstract datatypes for machine learning over data, and include compilers and runtime systems optimized for distributed execution on modern hardware. The database community excels in developing techniques for cost-estimating and optimizing declarative programs, and in exploiting data independence to optimize data placement and layout for performance. Elgohary et al’s work on “Scaling Machine Learning via Compressed Linear Algebra,”which appeared in the Proceedings of the VLDB Endowment [2], was conducted within IBM and Apache’s SystemML declarative machine learning project. It shows just how e↵ective such database techniques can be in a machine learning setting. The authors observe that the core data objects in machine learning – feature matrices, weight vectors – tend to have repeated values as well as regular structure, and may be quite large. Machine learning tasks over such data are composed from lower-level linear algebra operations. Such operations generally involve repeated floating-point computation that today are bandwidth-limited, by the ability of the CPU to traverse large matrices in RAM. The authors’ solution is to develop a compressed representation for matrices, as well as compressed linear algebra operations that work directly over the compressed matrix data. Together, these reduce the bandwidth required to perform the same computations, thus speeding performance dramatically. The paper cleverly adapts ideas first developed in relational database systems — column-oriented compression, sampling-based cost estimation, trading between compression speed and compression rate — to build an elegant implementation. The paper makes a number of key contributions. First, the authors identify a set of linear algebra primitives shared by multiple distributed machine learning platforms and algorithms. Second, they develop compression techniques not only for single columns in a matrix, but also “column grouping” techniques that capitalize on correlations between columns. They show that o↵set lists and run-length encoding o↵er a good set of trade-o↵s between e ciency and performance. Third, the paper develops cache-conscious algorithms for matrix multiplication and other operations. Finally, the paper shows how to estimate the sizes of compressed matrices and to choose an e↵ective compression strategy. Together, these techniques illustrate how database systems concepts can be adapted to great benefit in the machine learning space.