用于深度学习和大数据应用的自动数据处理和特征工程：一项调查

Journal of Information and Intelligence Pub Date : 2024-01-08 DOI:10.1016/j.jiixd.2024.01.002

Alhassan Mumuni , Fuseini Mumuni

{"title":"用于深度学习和大数据应用的自动数据处理和特征工程：一项调查","authors":"Alhassan Mumuni , Fuseini Mumuni","doi":"10.1016/j.jiixd.2024.01.002","DOIUrl":null,"url":null,"abstract":"<div><div>Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for big data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing – e.g., data cleaning, labeling, missing data imputation, and categorical data encoding – as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering – specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.</div></div>","PeriodicalId":100790,"journal":{"name":"Journal of Information and Intelligence","volume":"3 2","pages":"Pages 113-153"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated data processing and feature engineering for deep learning and big data applications: A survey\",\"authors\":\"Alhassan Mumuni , Fuseini Mumuni\",\"doi\":\"10.1016/j.jiixd.2024.01.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for big data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing – e.g., data cleaning, labeling, missing data imputation, and categorical data encoding – as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering – specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.</div></div>\",\"PeriodicalId\":100790,\"journal\":{\"name\":\"Journal of Information and Intelligence\",\"volume\":\"3 2\",\"pages\":\"Pages 113-153\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information and Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949715924000027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information and Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949715924000027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人工智能（AI）的现代方法旨在设计直接从数据中学习的算法。这种方法取得了令人印象深刻的结果，并对人工智能的进步做出了重大贡献，特别是在监督深度学习领域。它还简化了机器学习系统的设计，因为学习过程高度自动化。然而，并非传统深度学习管道中的所有数据处理任务都已实现自动化。在大多数情况下，数据必须手动收集、预处理，并通过数据增强进一步扩展，才能有效地用于训练。最近，出现了自动化这些任务的特殊技术。数据处理任务的自动化是由机器学习和大数据应用需要利用大量复杂的异构数据驱动的。如今，基于自动机器学习（AutoML）技术的端到端自动化数据处理系统能够通过自动化所有中间处理阶段，将原始数据转化为大数据任务的有用特征。在这项工作中，我们全面回顾了深度学习管道中自动化数据处理任务的方法，包括自动数据预处理-例如，数据清洗，标记，缺失数据输入和分类数据编码-以及数据增强（包括使用生成式人工智能方法生成合成数据）和特征工程-特别是自动特征提取，特征构建和特征选择。除了自动化特定的数据处理任务外，我们还讨论了使用AutoML方法和工具来同时优化机器学习管道的所有阶段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated data processing and feature engineering for deep learning and big data applications: A survey

Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for big data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing – e.g., data cleaning, labeling, missing data imputation, and categorical data encoding – as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering – specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Information and Intelligence

自引率

0.00%

发文量