Input selection for fast feature engineering

Michael R. Anderson, Michael J. Cafarella
{"title":"Input selection for fast feature engineering","authors":"Michael R. Anderson, Michael J. Cafarella","doi":"10.1109/ICDE.2016.7498272","DOIUrl":null,"url":null,"abstract":"The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"16 1","pages":"577-588"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39

Abstract

The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.
快速特征工程的输入选择
机器学习在大数据集上的应用已经成为当今许多重要而复杂的软件系统的重要组成部分。这样的训练系统通常基于监督学习任务,这些任务需要从数据中提取特征和信号,将复杂的原始数据对象提炼成少量显著值。经过训练的系统的成功很大程度上取决于其特征的质量。不幸的是,特征工程——编写将原始数据对象作为输入和输出适合机器学习算法的特征向量的代码的过程——是一种冗长而耗时的体验。因为“大数据”的输入是如此多样化,特征工程通常是一个试错过程,需要许多小的、迭代的代码更改。由于输入非常大,每次代码更改都可能涉及耗时的数据处理任务(例如,对Web抓取中的每个页面进行处理)。我们介绍了Zombie,一个以数据为中心的系统,通过智能输入选择加速特征工程,优化特征工程过程的“内循环”。我们的系统在某些情况下可以产生高达8倍的特征评估速度,并在其他情况下将工程师的等待时间从8小时减少到5小时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信