Input selection for fast feature engineering

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498272

Michael R. Anderson, Michael J. Cafarella

{"title":"Input selection for fast feature engineering","authors":"Michael R. Anderson, Michael J. Cafarella","doi":"10.1109/ICDE.2016.7498272","DOIUrl":null,"url":null,"abstract":"The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"16 1","pages":"577-588"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.

查看原文本刊更多论文

快速特征工程的输入选择

机器学习在大数据集上的应用已经成为当今许多重要而复杂的软件系统的重要组成部分。这样的训练系统通常基于监督学习任务，这些任务需要从数据中提取特征和信号，将复杂的原始数据对象提炼成少量显著值。经过训练的系统的成功很大程度上取决于其特征的质量。不幸的是，特征工程——编写将原始数据对象作为输入和输出适合机器学习算法的特征向量的代码的过程——是一种冗长而耗时的体验。因为“大数据”的输入是如此多样化，特征工程通常是一个试错过程，需要许多小的、迭代的代码更改。由于输入非常大，每次代码更改都可能涉及耗时的数据处理任务(例如，对Web抓取中的每个页面进行处理)。我们介绍了Zombie，一个以数据为中心的系统，通过智能输入选择加速特征工程，优化特征工程过程的“内循环”。我们的系统在某些情况下可以产生高达8倍的特征评估速度，并在其他情况下将工程师的等待时间从8小时减少到5小时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量