星号:生成具有自动主动监督的大型训练数据集

ACM/IMS transactions on data science Pub Date : 2020-07-09 DOI:10.1145/3385188

Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader

{"title":"星号:生成具有自动主动监督的大型训练数据集","authors":"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader","doi":"10.1145/3385188","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2577-3224/2020/05-ART13 $15.00 https://doi.org/10.1145/3385188 ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:2 M. Nashaat et al. large training datasets [1], the cost of labeling these datasets has become a significant expense for businesses and large organizations. In real-world settings, domain experience is usually required to accomplish or at least supervise such labeling processes; this makes the process of obtaining large-scale hand-labeled training data prohibitively expensive. For these reasons, several researchers [2–7] have proposed techniques to generate training data with minimal annotation effort. One approach that aims at generating labeled datasets at scale is weak supervision [2]. In weak supervision, practitioners turn to noisy labels [3], which are programmatically generated using cheaper annotation sources such as crowdsourcing [4], external knowledge bases [5], and user-defined heuristics [6]. Previous research [6–9] has shown that weak supervision can produce less-than-ideal training datasets at a large scale for a wide range of applications. These labels can then be used to train many complex machine learning models, such as deep learning. Alternatively, other well-studied techniques rely on semisupervised learning [10, 11]. Semisupervised techniques exploit a small labeled set to derive assumptions about the data structure and leverage a larger unlabeled dataset. For this purpose, some techniques [11] employ the concept of generative models to utilize the unlabeled data and learn the data representation. Generative models produce samples after learning the underlying data distribution; these samples can then be used as training labels for discriminative models. On the other hand, active learning (AL) [7] is a special kind of semisupervised learning that has been used for decades to achieve a high level of classification accuracy while optimizing the annotation cost. In AL settings, instead of manually labeling an entire dataset, an algorithm iteratively selects the most valuable points to classify and asks the user to only label these points. Although AL does not aim at producing labeled datasets, it helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. A closer look at these labeling techniques, however, reveals several gaps and shortcomings [12–16]. On the one hand, since cheaper annotation methods are used in weak supervision, these sources are expected to overlap and conflict, which affects the quality of the resulting labels [12]. To estimate the level of noise in the generated labels, previous studies introduce the data programming (DP) paradigm [2, 12], which uses generative models to integrate the outcome of multiple weak supervision. Nevertheless, the uncertainty levels originating from these weak sources can complicate the process of learning the structure of these generative models [12]. Moreover, these approaches require users to design a set of user-defined heuristics [6] to encode their domain experience, which can be an expensive and time-consuming process [13]. On the other hand, active learning can be expensive when applied to high-dimensional datasets [14]. In pool-based settings [17], the active learner performs an iterative process to choose one or more points from an unlabeled pool to query the user in each iteration. This iterative process involves ranking all the points in the unlabeled pool, selecting the points for which true labels should be provided, training a model, and evaluating its performance using a held-out test set. Therefore, any imbalance between the sizes of the unlabeled pool and the labeled dataset can affect the time complexity of the process and increase the annotation cost [14]. Moreover, other studies [15, 16] show that in situations where the unlabeled data points cannot be entirely separated, active learning does not provide much superiority over passive learning. To overcome some of these challenges, we propose Asterisk, a framework to generate highquality training datasets at scale. An overview of the system is presented in Figure 1. As shown in the figure, instead of relying on the end-users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics (weak supervision sources) to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each and every iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision 13:3 Fig. 1. An overview of the proposed system. model their accuracies. To enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. As the figure shows, the refinement process can be repeated to further enhance the quality of the generated labels. Finally, the framework examines the refined labels and outputs a set of probabilistic labels that can be used to train any downstream classifier. A prototype implementation of the proposed framework is available at https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk. To evaluate the proposed method, we compare its performance with the performances of four state-of-the-art techniques including data programming [2], automated weak supervision [13], and traditional active learning strategies [17]. During the experiments, we report the labeling accuracy, annotation cost, and performance of the end model trained with the generated labels. The primary contributions of this research can be summarized as follows: • An end-to-end labeling framework is proposed to create high-quality, large-scale training datasets. We describe the architecture of the proposed system, which includes a novel process of automatic generation of labeling heuristics instead of relying on the end-user to manually define the weak sources. • We propose a data-driven active learning process to enhance the accuracy of the generated weak labels. The process learns the selection policy while taking the distribution of the underlying data and the labeling confidence to optimize user engagement. • We applied a comprehensive set of experiments to evaluate the proposed method against state-of-the-art techniques. The experimental evaluation explores a wide range of domains with 10 datasets that vary in size and dimensionality with a maximum size of 11M records. We also use a real-world business dataset of 1.5M records provided by our industrial partner, IBM. The experiments also include a micro-benchmarking to evaluate the individual components of the proposed approach. The remainder of the article is structured as follows: Section 2 presents the background related to this research. Section 3 states, in detail, the design of the proposed solution. Section 4 presents ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:4 M. Nashaat et al. the performed experiments and reports the obtained results. Section 5 discusses related work, and Section 6 concludes the article.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"3 1","pages":"13:1-13:25"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"Asterisk: Generating Large Training Datasets with Automatic Active Supervision\",\"authors\":\"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader\",\"doi\":\"10.1145/3385188\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2577-3224/2020/05-ART13 $15.00 https://doi.org/10.1145/3385188 ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:2 M. Nashaat et al. large training datasets [1], the cost of labeling these datasets has become a significant expense for businesses and large organizations. In real-world settings, domain experience is usually required to accomplish or at least supervise such labeling processes; this makes the process of obtaining large-scale hand-labeled training data prohibitively expensive. For these reasons, several researchers [2–7] have proposed techniques to generate training data with minimal annotation effort. One approach that aims at generating labeled datasets at scale is weak supervision [2]. In weak supervision, practitioners turn to noisy labels [3], which are programmatically generated using cheaper annotation sources such as crowdsourcing [4], external knowledge bases [5], and user-defined heuristics [6]. Previous research [6–9] has shown that weak supervision can produce less-than-ideal training datasets at a large scale for a wide range of applications. These labels can then be used to train many complex machine learning models, such as deep learning. Alternatively, other well-studied techniques rely on semisupervised learning [10, 11]. Semisupervised techniques exploit a small labeled set to derive assumptions about the data structure and leverage a larger unlabeled dataset. For this purpose, some techniques [11] employ the concept of generative models to utilize the unlabeled data and learn the data representation. Generative models produce samples after learning the underlying data distribution; these samples can then be used as training labels for discriminative models. On the other hand, active learning (AL) [7] is a special kind of semisupervised learning that has been used for decades to achieve a high level of classification accuracy while optimizing the annotation cost. In AL settings, instead of manually labeling an entire dataset, an algorithm iteratively selects the most valuable points to classify and asks the user to only label these points. Although AL does not aim at producing labeled datasets, it helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. A closer look at these labeling techniques, however, reveals several gaps and shortcomings [12–16]. On the one hand, since cheaper annotation methods are used in weak supervision, these sources are expected to overlap and conflict, which affects the quality of the resulting labels [12]. To estimate the level of noise in the generated labels, previous studies introduce the data programming (DP) paradigm [2, 12], which uses generative models to integrate the outcome of multiple weak supervision. Nevertheless, the uncertainty levels originating from these weak sources can complicate the process of learning the structure of these generative models [12]. Moreover, these approaches require users to design a set of user-defined heuristics [6] to encode their domain experience, which can be an expensive and time-consuming process [13]. On the other hand, active learning can be expensive when applied to high-dimensional datasets [14]. In pool-based settings [17], the active learner performs an iterative process to choose one or more points from an unlabeled pool to query the user in each iteration. This iterative process involves ranking all the points in the unlabeled pool, selecting the points for which true labels should be provided, training a model, and evaluating its performance using a held-out test set. Therefore, any imbalance between the sizes of the unlabeled pool and the labeled dataset can affect the time complexity of the process and increase the annotation cost [14]. Moreover, other studies [15, 16] show that in situations where the unlabeled data points cannot be entirely separated, active learning does not provide much superiority over passive learning. To overcome some of these challenges, we propose Asterisk, a framework to generate highquality training datasets at scale. An overview of the system is presented in Figure 1. As shown in the figure, instead of relying on the end-users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics (weak supervision sources) to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each and every iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision 13:3 Fig. 1. An overview of the proposed system. model their accuracies. To enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. As the figure shows, the refinement process can be repeated to further enhance the quality of the generated labels. Finally, the framework examines the refined labels and outputs a set of probabilistic labels that can be used to train any downstream classifier. A prototype implementation of the proposed framework is available at https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk. To evaluate the proposed method, we compare its performance with the performances of four state-of-the-art techniques including data programming [2], automated weak supervision [13], and traditional active learning strategies [17]. During the experiments, we report the labeling accuracy, annotation cost, and performance of the end model trained with the generated labels. The primary contributions of this research can be summarized as follows: • An end-to-end labeling framework is proposed to create high-quality, large-scale training datasets. We describe the architecture of the proposed system, which includes a novel process of automatic generation of labeling heuristics instead of relying on the end-user to manually define the weak sources. • We propose a data-driven active learning process to enhance the accuracy of the generated weak labels. The process learns the selection policy while taking the distribution of the underlying data and the labeling confidence to optimize user engagement. • We applied a comprehensive set of experiments to evaluate the proposed method against state-of-the-art techniques. The experimental evaluation explores a wide range of domains with 10 datasets that vary in size and dimensionality with a maximum size of 11M records. We also use a real-world business dataset of 1.5M records provided by our industrial partner, IBM. The experiments also include a micro-benchmarking to evaluate the individual components of the proposed approach. The remainder of the article is structured as follows: Section 2 presents the background related to this research. Section 3 states, in detail, the design of the proposed solution. Section 4 presents ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:4 M. Nashaat et al. the performed experiments and reports the obtained results. Section 5 discusses related work, and Section 6 concludes the article.\",\"PeriodicalId\":93404,\"journal\":{\"name\":\"ACM/IMS transactions on data science\",\"volume\":\"3 1\",\"pages\":\"13:1-13:25\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM/IMS transactions on data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3385188\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3385188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 44

摘要

2、第十三条。出版日期:2020年5月。星号:生成具有自动主动监督的大型训练数据集建议系统的概述。模拟它们的准确性。为了提高生成标签的质量，该框架通过应用一种新的数据驱动的人工智能过程来提高启发式算法的准确性。在此过程中，系统检查生成的弱标签以及启发式建模的准确性，以帮助学习者决定用户应该提供真实标签的点。该过程旨在增强训练数据的准确性和覆盖范围，同时让用户参与到执行增强过程的循环中。因此，通过合并底层数据表示，用户只查询那些有望提高整体标记质量的点。然后，使用用户提供的真实标签来改进启发式生成的初始标签。如图所示，可以重复细化过程以进一步提高生成标签的质量。最后，该框架检查经过改进的标签并输出一组可用于训练任何下游分类器的概率标签。建议框架的原型实现可在https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk上获得。为了评估所提出的方法，我们将其性能与四种最先进的技术(包括数据编程[2]、自动弱监督[13]和传统主动学习策略[17])的性能进行了比较。在实验中，我们报告了用生成的标签训练的最终模型的标注精度、标注成本和性能。本研究的主要贡献可以概括如下:•提出了一个端到端标记框架，以创建高质量的大规模训练数据集。我们描述了该系统的架构，其中包括一个自动生成标记启发式的新过程，而不是依赖于最终用户手动定义弱源。•我们提出了一个数据驱动的主动学习过程，以提高生成的弱标签的准确性。该过程学习选择策略，同时采用底层数据的分布和标签置信度来优化用户参与度。•我们应用了一套全面的实验来评估所提出的方法与最先进的技术。实验评估通过10个数据集探索了广泛的领域，这些数据集的大小和维度各不相同，最大大小为11M条记录。我们还使用由我们的工业合作伙伴IBM提供的包含150万条记录的真实业务数据集。实验还包括微观基准测试，以评估所提出方法的各个组成部分。文章的其余部分结构如下:第2节介绍了与本研究相关的背景。第3节详细说明了所提出的解决方案的设计。第4节介绍ACM/IMS数据科学汇刊，第1卷，第2期，第13条。出版日期:2020年5月。[13:4] M. Nashaat等人进行了实验并报告了得到的结果。第5节讨论了相关工作，第6节对本文进行了总结。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Asterisk: Generating Large Training Datasets with Automatic Active Supervision

ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2577-3224/2020/05-ART13 $15.00 https://doi.org/10.1145/3385188 ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:2 M. Nashaat et al. large training datasets [1], the cost of labeling these datasets has become a significant expense for businesses and large organizations. In real-world settings, domain experience is usually required to accomplish or at least supervise such labeling processes; this makes the process of obtaining large-scale hand-labeled training data prohibitively expensive. For these reasons, several researchers [2–7] have proposed techniques to generate training data with minimal annotation effort. One approach that aims at generating labeled datasets at scale is weak supervision [2]. In weak supervision, practitioners turn to noisy labels [3], which are programmatically generated using cheaper annotation sources such as crowdsourcing [4], external knowledge bases [5], and user-defined heuristics [6]. Previous research [6–9] has shown that weak supervision can produce less-than-ideal training datasets at a large scale for a wide range of applications. These labels can then be used to train many complex machine learning models, such as deep learning. Alternatively, other well-studied techniques rely on semisupervised learning [10, 11]. Semisupervised techniques exploit a small labeled set to derive assumptions about the data structure and leverage a larger unlabeled dataset. For this purpose, some techniques [11] employ the concept of generative models to utilize the unlabeled data and learn the data representation. Generative models produce samples after learning the underlying data distribution; these samples can then be used as training labels for discriminative models. On the other hand, active learning (AL) [7] is a special kind of semisupervised learning that has been used for decades to achieve a high level of classification accuracy while optimizing the annotation cost. In AL settings, instead of manually labeling an entire dataset, an algorithm iteratively selects the most valuable points to classify and asks the user to only label these points. Although AL does not aim at producing labeled datasets, it helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. A closer look at these labeling techniques, however, reveals several gaps and shortcomings [12–16]. On the one hand, since cheaper annotation methods are used in weak supervision, these sources are expected to overlap and conflict, which affects the quality of the resulting labels [12]. To estimate the level of noise in the generated labels, previous studies introduce the data programming (DP) paradigm [2, 12], which uses generative models to integrate the outcome of multiple weak supervision. Nevertheless, the uncertainty levels originating from these weak sources can complicate the process of learning the structure of these generative models [12]. Moreover, these approaches require users to design a set of user-defined heuristics [6] to encode their domain experience, which can be an expensive and time-consuming process [13]. On the other hand, active learning can be expensive when applied to high-dimensional datasets [14]. In pool-based settings [17], the active learner performs an iterative process to choose one or more points from an unlabeled pool to query the user in each iteration. This iterative process involves ranking all the points in the unlabeled pool, selecting the points for which true labels should be provided, training a model, and evaluating its performance using a held-out test set. Therefore, any imbalance between the sizes of the unlabeled pool and the labeled dataset can affect the time complexity of the process and increase the annotation cost [14]. Moreover, other studies [15, 16] show that in situations where the unlabeled data points cannot be entirely separated, active learning does not provide much superiority over passive learning. To overcome some of these challenges, we propose Asterisk, a framework to generate highquality training datasets at scale. An overview of the system is presented in Figure 1. As shown in the figure, instead of relying on the end-users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics (weak supervision sources) to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each and every iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision 13:3 Fig. 1. An overview of the proposed system. model their accuracies. To enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. As the figure shows, the refinement process can be repeated to further enhance the quality of the generated labels. Finally, the framework examines the refined labels and outputs a set of probabilistic labels that can be used to train any downstream classifier. A prototype implementation of the proposed framework is available at https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk. To evaluate the proposed method, we compare its performance with the performances of four state-of-the-art techniques including data programming [2], automated weak supervision [13], and traditional active learning strategies [17]. During the experiments, we report the labeling accuracy, annotation cost, and performance of the end model trained with the generated labels. The primary contributions of this research can be summarized as follows: • An end-to-end labeling framework is proposed to create high-quality, large-scale training datasets. We describe the architecture of the proposed system, which includes a novel process of automatic generation of labeling heuristics instead of relying on the end-user to manually define the weak sources. • We propose a data-driven active learning process to enhance the accuracy of the generated weak labels. The process learns the selection policy while taking the distribution of the underlying data and the labeling confidence to optimize user engagement. • We applied a comprehensive set of experiments to evaluate the proposed method against state-of-the-art techniques. The experimental evaluation explores a wide range of domains with 10 datasets that vary in size and dimensionality with a maximum size of 11M records. We also use a real-world business dataset of 1.5M records provided by our industrial partner, IBM. The experiments also include a micro-benchmarking to evaluate the individual components of the proposed approach. The remainder of the article is structured as follows: Section 2 presents the background related to this research. Section 3 states, in detail, the design of the proposed solution. Section 4 presents ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:4 M. Nashaat et al. the performed experiments and reports the obtained results. Section 5 discusses related work, and Section 6 concludes the article.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM/IMS transactions on data science

自引率

0.00%

发文量