Qi She, Jingwei Zhang, Ya Zhou, Qing Yang, Mingfei Qin
{"title":"基于Spark的分布式高维矩阵运算优化","authors":"Qi She, Jingwei Zhang, Ya Zhou, Qing Yang, Mingfei Qin","doi":"10.1109/ICACI.2019.8778546","DOIUrl":null,"url":null,"abstract":"In the era of big data, the mining of valuable information from massive data has been increasingly valued by industry, academia and governments. Mining massive data needs data mining algorithms such as principal component analysis, regression, and clustering, which often use large-scale matrix operations. When the dimension of the matrix is very large, it is difficult to perform high dimensional matrix operations, but the distributed method can effectively solve the problems of computational scalability and computational complexity brought by high-dimensional matrix. On the distributed platform, Spark, we proposed a distributed matrix operation execution strategy RPMM which performs better in both matrix computing concurrency and the overhead of data shuffling. At the same time, the local sensitive hash algorithm is introduced to provide faster row vector similarity computing. Moreover, compared to the matrix operation on a single machine, these distributed matrix operations can effectively solve the scalability problem of large matrix operations.","PeriodicalId":213368,"journal":{"name":"2019 Eleventh International Conference on Advanced Computational Intelligence (ICACI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed High-Dimension Matrix Operation Optimization on Spark\",\"authors\":\"Qi She, Jingwei Zhang, Ya Zhou, Qing Yang, Mingfei Qin\",\"doi\":\"10.1109/ICACI.2019.8778546\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the era of big data, the mining of valuable information from massive data has been increasingly valued by industry, academia and governments. Mining massive data needs data mining algorithms such as principal component analysis, regression, and clustering, which often use large-scale matrix operations. When the dimension of the matrix is very large, it is difficult to perform high dimensional matrix operations, but the distributed method can effectively solve the problems of computational scalability and computational complexity brought by high-dimensional matrix. On the distributed platform, Spark, we proposed a distributed matrix operation execution strategy RPMM which performs better in both matrix computing concurrency and the overhead of data shuffling. At the same time, the local sensitive hash algorithm is introduced to provide faster row vector similarity computing. Moreover, compared to the matrix operation on a single machine, these distributed matrix operations can effectively solve the scalability problem of large matrix operations.\",\"PeriodicalId\":213368,\"journal\":{\"name\":\"2019 Eleventh International Conference on Advanced Computational Intelligence (ICACI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Eleventh International Conference on Advanced Computational Intelligence (ICACI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICACI.2019.8778546\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Eleventh International Conference on Advanced Computational Intelligence (ICACI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACI.2019.8778546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Distributed High-Dimension Matrix Operation Optimization on Spark
In the era of big data, the mining of valuable information from massive data has been increasingly valued by industry, academia and governments. Mining massive data needs data mining algorithms such as principal component analysis, regression, and clustering, which often use large-scale matrix operations. When the dimension of the matrix is very large, it is difficult to perform high dimensional matrix operations, but the distributed method can effectively solve the problems of computational scalability and computational complexity brought by high-dimensional matrix. On the distributed platform, Spark, we proposed a distributed matrix operation execution strategy RPMM which performs better in both matrix computing concurrency and the overhead of data shuffling. At the same time, the local sensitive hash algorithm is introduced to provide faster row vector similarity computing. Moreover, compared to the matrix operation on a single machine, these distributed matrix operations can effectively solve the scalability problem of large matrix operations.