Parallelization of Data Science Tasks, an Experimental Overview

Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition Pub Date : 2022-11-17 DOI:10.1145/3581807.3581878

Oscar Castro, P. Bruneau, Jean-Sébastien Sottet, Dario Torregrossa

引用次数: 1

Abstract

The practice of data science and machine learning often involves training many kinds of models, for inferring some target variable, or extracting structured knowledge from data. Training procedures generally require lengthy and intensive computations, so a natural step for data scientists is to try to accelerate these procedures, typically through parallelization as supported by multiple CPU cores and GPU devices. In this paper, we focus on Python libraries commonly used by machine learning practitioners, and propose a case-based experimental approach to overview mainstream tools for software acceleration. For each use case, we highlight and quantify the optimizations from the baseline implementations to the optimized versions. Finally, we draw a taxonomy of the tools and techniques involved in our experiments, and identify common pitfalls, in view to provide actionable guidelines to data scientists and code optimization tools developers.

查看原文本刊更多论文

数据科学任务的并行化，实验综述

数据科学和机器学习的实践通常涉及训练多种模型，用于推断一些目标变量，或从数据中提取结构化知识。训练过程通常需要长时间和密集的计算，因此数据科学家的自然步骤是尝试加速这些过程，通常通过多个CPU内核和GPU设备支持的并行化。在本文中，我们专注于机器学习从业者常用的Python库，并提出了一种基于案例的实验方法来概述软件加速的主流工具。对于每个用例，我们强调并量化从基线实现到优化版本的优化。最后，我们对实验中涉及的工具和技术进行了分类，并确定了常见的陷阱，以便为数据科学家和代码优化工具开发人员提供可操作的指导方针。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition

自引率

0.00%

发文量