dislib: Large Scale High Performance Machine Learning in Python

J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia
{"title":"dislib: Large Scale High Performance Machine Learning in Python","authors":"J. '. Cid-Fuentes, S. Solà, Pol Álvarez, A. Castro-Ginard, Rosa M. Badia","doi":"10.1109/eScience.2019.00018","DOIUrl":null,"url":null,"abstract":"In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on eScience (eScience)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2019.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

In recent years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This can be leveraged in numerous research areas, such as genomics, earth sciences, and astrophysics, to gain valuable insight. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.
dislib: Python中的大规模高性能机器学习
近年来,机器学习已经被证明是从数据中提取知识的一个非常有用的工具。这可以用于许多研究领域,如基因组学、地球科学和天体物理学,以获得有价值的见解。与此同时,Python因其高生产率和丰富的生态系统而成为研究人员中最受欢迎的编程语言之一。不幸的是,现有的Python机器学习库不能扩展到大型数据集,非专家很难使用,而且很难在高性能计算集群中进行设置。这些限制阻碍了科学家在研究中充分利用机器学习的潜力。在本文中,我们提出并评估了dislib,一个基于pycomps编程模型的分布式机器学习库,它解决了其他现有库的问题。在我们的评估中,我们表明dislib可以比其他流行的分布式机器学习库(如MLlib)快9倍,并且可以处理高达16倍的数据集。除此之外,我们还展示了如何使用dislib将真正的科学应用程序的计算时间从18小时减少到17分钟。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信