{"title":"An Auto-ML Approach Applied to Text Classification","authors":"Douglas Nunes de Oliveira, L. Merschmann","doi":"10.1145/3539637.3557054","DOIUrl":null,"url":null,"abstract":"Automated Machine Learning (AutoML) is a research area that aims to help humans solve Machine Learning (ML) problems by automatically discovering good model pipelines (algorithms and their hyperparameters for every stage of a machine learning process) for a given dataset. Since we have a combinatorial optimization problem for which it is impossible to evaluate all possible pipelines, most AutoML systems use Evolutionary Algorithm (EA) or Bayesian Optimization (BO) to find a good solution. As these systems usually evaluate the pipelines’ performance using the k-fold cross-validation method, the chance of finding an overfitted solution increases with the number of pipelines evaluated. Therefore, to avoid the aforementioned issue, we propose an Auto-ML system, named Auto-ML System for Text Classification (ASTeC), that uses the Bootstrap Bias Corrected CV (BBC-CV) to evaluate the pipelines’ performance. More specifically, the proposed system combines EA, BO, and BBC-CV to find a good model pipeline for the text classification task. We evaluate our proposal by comparing it against two state-of-the-art systems, the Tree-based Pipeline Optimization Tool (TPOT) and Google Cloud AutoML service. To do so, we use seven public datasets composed of written Brazilian Portuguese texts from the sentiment analysis domain. Statistical tests show that our system is equivalent to or better than both of them in all evaluated datasets.","PeriodicalId":350776,"journal":{"name":"Proceedings of the Brazilian Symposium on Multimedia and the Web","volume":"159 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Brazilian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539637.3557054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Automated Machine Learning (AutoML) is a research area that aims to help humans solve Machine Learning (ML) problems by automatically discovering good model pipelines (algorithms and their hyperparameters for every stage of a machine learning process) for a given dataset. Since we have a combinatorial optimization problem for which it is impossible to evaluate all possible pipelines, most AutoML systems use Evolutionary Algorithm (EA) or Bayesian Optimization (BO) to find a good solution. As these systems usually evaluate the pipelines’ performance using the k-fold cross-validation method, the chance of finding an overfitted solution increases with the number of pipelines evaluated. Therefore, to avoid the aforementioned issue, we propose an Auto-ML system, named Auto-ML System for Text Classification (ASTeC), that uses the Bootstrap Bias Corrected CV (BBC-CV) to evaluate the pipelines’ performance. More specifically, the proposed system combines EA, BO, and BBC-CV to find a good model pipeline for the text classification task. We evaluate our proposal by comparing it against two state-of-the-art systems, the Tree-based Pipeline Optimization Tool (TPOT) and Google Cloud AutoML service. To do so, we use seven public datasets composed of written Brazilian Portuguese texts from the sentiment analysis domain. Statistical tests show that our system is equivalent to or better than both of them in all evaluated datasets.
自动化机器学习(AutoML)是一个研究领域,旨在通过自动发现给定数据集的良好模型管道(机器学习过程的每个阶段的算法及其超参数)来帮助人类解决机器学习(ML)问题。由于我们有一个组合优化问题,它不可能评估所有可能的管道,大多数AutoML系统使用进化算法(EA)或贝叶斯优化(BO)来找到一个好的解决方案。由于这些系统通常使用k-fold交叉验证方法来评估管道的性能,因此发现过拟合解的机会随着评估管道数量的增加而增加。因此,为了避免上述问题,我们提出了一个Auto-ML系统,名为Auto-ML system for Text Classification (ASTeC),它使用Bootstrap Bias Corrected CV (BBC-CV)来评估管道的性能。更具体地说,该系统结合了EA、BO和BBC-CV,为文本分类任务找到了一个良好的模型管道。我们通过比较两个最先进的系统来评估我们的建议,即基于树的管道优化工具(TPOT)和谷歌云自动服务。为此,我们使用了七个公共数据集,这些数据集由情感分析领域的书面巴西葡萄牙语文本组成。统计测试表明,在所有评估的数据集中,我们的系统等于或优于两者。