求助PDF
{"title":"星号:生成具有自动主动监督的大型训练数据集","authors":"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader","doi":"10.1145/3385188","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2577-3224/2020/05-ART13 $15.00 https://doi.org/10.1145/3385188 ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:2 M. Nashaat et al. large training datasets [1], the cost of labeling these datasets has become a significant expense for businesses and large organizations. In real-world settings, domain experience is usually required to accomplish or at least supervise such labeling processes; this makes the process of obtaining large-scale hand-labeled training data prohibitively expensive. For these reasons, several researchers [2–7] have proposed techniques to generate training data with minimal annotation effort. One approach that aims at generating labeled datasets at scale is weak supervision [2]. In weak supervision, practitioners turn to noisy labels [3], which are programmatically generated using cheaper annotation sources such as crowdsourcing [4], external knowledge bases [5], and user-defined heuristics [6]. Previous research [6–9] has shown that weak supervision can produce less-than-ideal training datasets at a large scale for a wide range of applications. These labels can then be used to train many complex machine learning models, such as deep learning. Alternatively, other well-studied techniques rely on semisupervised learning [10, 11]. Semisupervised techniques exploit a small labeled set to derive assumptions about the data structure and leverage a larger unlabeled dataset. For this purpose, some techniques [11] employ the concept of generative models to utilize the unlabeled data and learn the data representation. Generative models produce samples after learning the underlying data distribution; these samples can then be used as training labels for discriminative models. On the other hand, active learning (AL) [7] is a special kind of semisupervised learning that has been used for decades to achieve a high level of classification accuracy while optimizing the annotation cost. In AL settings, instead of manually labeling an entire dataset, an algorithm iteratively selects the most valuable points to classify and asks the user to only label these points. Although AL does not aim at producing labeled datasets, it helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. A closer look at these labeling techniques, however, reveals several gaps and shortcomings [12–16]. On the one hand, since cheaper annotation methods are used in weak supervision, these sources are expected to overlap and conflict, which affects the quality of the resulting labels [12]. To estimate the level of noise in the generated labels, previous studies introduce the data programming (DP) paradigm [2, 12], which uses generative models to integrate the outcome of multiple weak supervision. Nevertheless, the uncertainty levels originating from these weak sources can complicate the process of learning the structure of these generative models [12]. Moreover, these approaches require users to design a set of user-defined heuristics [6] to encode their domain experience, which can be an expensive and time-consuming process [13]. On the other hand, active learning can be expensive when applied to high-dimensional datasets [14]. In pool-based settings [17], the active learner performs an iterative process to choose one or more points from an unlabeled pool to query the user in each iteration. This iterative process involves ranking all the points in the unlabeled pool, selecting the points for which true labels should be provided, training a model, and evaluating its performance using a held-out test set. Therefore, any imbalance between the sizes of the unlabeled pool and the labeled dataset can affect the time complexity of the process and increase the annotation cost [14]. Moreover, other studies [15, 16] show that in situations where the unlabeled data points cannot be entirely separated, active learning does not provide much superiority over passive learning. To overcome some of these challenges, we propose Asterisk, a framework to generate highquality training datasets at scale. An overview of the system is presented in Figure 1. As shown in the figure, instead of relying on the end-users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics (weak supervision sources) to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each and every iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision 13:3 Fig. 1. An overview of the proposed system. model their accuracies. To enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. As the figure shows, the refinement process can be repeated to further enhance the quality of the generated labels. Finally, the framework examines the refined labels and outputs a set of probabilistic labels that can be used to train any downstream classifier. A prototype implementation of the proposed framework is available at https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk. To evaluate the proposed method, we compare its performance with the performances of four state-of-the-art techniques including data programming [2], automated weak supervision [13], and traditional active learning strategies [17]. During the experiments, we report the labeling accuracy, annotation cost, and performance of the end model trained with the generated labels. The primary contributions of this research can be summarized as follows: • An end-to-end labeling framework is proposed to create high-quality, large-scale training datasets. We describe the architecture of the proposed system, which includes a novel process of automatic generation of labeling heuristics instead of relying on the end-user to manually define the weak sources. • We propose a data-driven active learning process to enhance the accuracy of the generated weak labels. The process learns the selection policy while taking the distribution of the underlying data and the labeling confidence to optimize user engagement. • We applied a comprehensive set of experiments to evaluate the proposed method against state-of-the-art techniques. The experimental evaluation explores a wide range of domains with 10 datasets that vary in size and dimensionality with a maximum size of 11M records. We also use a real-world business dataset of 1.5M records provided by our industrial partner, IBM. The experiments also include a micro-benchmarking to evaluate the individual components of the proposed approach. The remainder of the article is structured as follows: Section 2 presents the background related to this research. Section 3 states, in detail, the design of the proposed solution. Section 4 presents ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:4 M. Nashaat et al. the performed experiments and reports the obtained results. Section 5 discusses related work, and Section 6 concludes the article.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"3 1","pages":"13:1-13:25"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"Asterisk: Generating Large Training Datasets with Automatic Active Supervision\",\"authors\":\"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader\",\"doi\":\"10.1145/3385188\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2577-3224/2020/05-ART13 $15.00 https://doi.org/10.1145/3385188 ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:2 M. Nashaat et al. large training datasets [1], the cost of labeling these datasets has become a significant expense for businesses and large organizations. In real-world settings, domain experience is usually required to accomplish or at least supervise such labeling processes; this makes the process of obtaining large-scale hand-labeled training data prohibitively expensive. For these reasons, several researchers [2–7] have proposed techniques to generate training data with minimal annotation effort. One approach that aims at generating labeled datasets at scale is weak supervision [2]. In weak supervision, practitioners turn to noisy labels [3], which are programmatically generated using cheaper annotation sources such as crowdsourcing [4], external knowledge bases [5], and user-defined heuristics [6]. Previous research [6–9] has shown that weak supervision can produce less-than-ideal training datasets at a large scale for a wide range of applications. These labels can then be used to train many complex machine learning models, such as deep learning. Alternatively, other well-studied techniques rely on semisupervised learning [10, 11]. Semisupervised techniques exploit a small labeled set to derive assumptions about the data structure and leverage a larger unlabeled dataset. For this purpose, some techniques [11] employ the concept of generative models to utilize the unlabeled data and learn the data representation. Generative models produce samples after learning the underlying data distribution; these samples can then be used as training labels for discriminative models. On the other hand, active learning (AL) [7] is a special kind of semisupervised learning that has been used for decades to achieve a high level of classification accuracy while optimizing the annotation cost. In AL settings, instead of manually labeling an entire dataset, an algorithm iteratively selects the most valuable points to classify and asks the user to only label these points. Although AL does not aim at producing labeled datasets, it helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. A closer look at these labeling techniques, however, reveals several gaps and shortcomings [12–16]. On the one hand, since cheaper annotation methods are used in weak supervision, these sources are expected to overlap and conflict, which affects the quality of the resulting labels [12]. To estimate the level of noise in the generated labels, previous studies introduce the data programming (DP) paradigm [2, 12], which uses generative models to integrate the outcome of multiple weak supervision. Nevertheless, the uncertainty levels originating from these weak sources can complicate the process of learning the structure of these generative models [12]. Moreover, these approaches require users to design a set of user-defined heuristics [6] to encode their domain experience, which can be an expensive and time-consuming process [13]. On the other hand, active learning can be expensive when applied to high-dimensional datasets [14]. In pool-based settings [17], the active learner performs an iterative process to choose one or more points from an unlabeled pool to query the user in each iteration. This iterative process involves ranking all the points in the unlabeled pool, selecting the points for which true labels should be provided, training a model, and evaluating its performance using a held-out test set. Therefore, any imbalance between the sizes of the unlabeled pool and the labeled dataset can affect the time complexity of the process and increase the annotation cost [14]. Moreover, other studies [15, 16] show that in situations where the unlabeled data points cannot be entirely separated, active learning does not provide much superiority over passive learning. To overcome some of these challenges, we propose Asterisk, a framework to generate highquality training datasets at scale. An overview of the system is presented in Figure 1. As shown in the figure, instead of relying on the end-users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics (weak supervision sources) to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each and every iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision 13:3 Fig. 1. An overview of the proposed system. model their accuracies. To enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. As the figure shows, the refinement process can be repeated to further enhance the quality of the generated labels. Finally, the framework examines the refined labels and outputs a set of probabilistic labels that can be used to train any downstream classifier. A prototype implementation of the proposed framework is available at https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk. To evaluate the proposed method, we compare its performance with the performances of four state-of-the-art techniques including data programming [2], automated weak supervision [13], and traditional active learning strategies [17]. During the experiments, we report the labeling accuracy, annotation cost, and performance of the end model trained with the generated labels. The primary contributions of this research can be summarized as follows: • An end-to-end labeling framework is proposed to create high-quality, large-scale training datasets. We describe the architecture of the proposed system, which includes a novel process of automatic generation of labeling heuristics instead of relying on the end-user to manually define the weak sources. • We propose a data-driven active learning process to enhance the accuracy of the generated weak labels. The process learns the selection policy while taking the distribution of the underlying data and the labeling confidence to optimize user engagement. • We applied a comprehensive set of experiments to evaluate the proposed method against state-of-the-art techniques. The experimental evaluation explores a wide range of domains with 10 datasets that vary in size and dimensionality with a maximum size of 11M records. We also use a real-world business dataset of 1.5M records provided by our industrial partner, IBM. The experiments also include a micro-benchmarking to evaluate the individual components of the proposed approach. The remainder of the article is structured as follows: Section 2 presents the background related to this research. Section 3 states, in detail, the design of the proposed solution. Section 4 presents ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:4 M. Nashaat et al. the performed experiments and reports the obtained results. Section 5 discusses related work, and Section 6 concludes the article.\",\"PeriodicalId\":93404,\"journal\":{\"name\":\"ACM/IMS transactions on data science\",\"volume\":\"3 1\",\"pages\":\"13:1-13:25\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM/IMS transactions on data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3385188\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3385188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 44
引用
批量引用
Asterisk: Generating Large Training Datasets with Automatic Active Supervision
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2577-3224/2020/05-ART13 $15.00 https://doi.org/10.1145/3385188 ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:2 M. Nashaat et al. large training datasets [1], the cost of labeling these datasets has become a significant expense for businesses and large organizations. In real-world settings, domain experience is usually required to accomplish or at least supervise such labeling processes; this makes the process of obtaining large-scale hand-labeled training data prohibitively expensive. For these reasons, several researchers [2–7] have proposed techniques to generate training data with minimal annotation effort. One approach that aims at generating labeled datasets at scale is weak supervision [2]. In weak supervision, practitioners turn to noisy labels [3], which are programmatically generated using cheaper annotation sources such as crowdsourcing [4], external knowledge bases [5], and user-defined heuristics [6]. Previous research [6–9] has shown that weak supervision can produce less-than-ideal training datasets at a large scale for a wide range of applications. These labels can then be used to train many complex machine learning models, such as deep learning. Alternatively, other well-studied techniques rely on semisupervised learning [10, 11]. Semisupervised techniques exploit a small labeled set to derive assumptions about the data structure and leverage a larger unlabeled dataset. For this purpose, some techniques [11] employ the concept of generative models to utilize the unlabeled data and learn the data representation. Generative models produce samples after learning the underlying data distribution; these samples can then be used as training labels for discriminative models. On the other hand, active learning (AL) [7] is a special kind of semisupervised learning that has been used for decades to achieve a high level of classification accuracy while optimizing the annotation cost. In AL settings, instead of manually labeling an entire dataset, an algorithm iteratively selects the most valuable points to classify and asks the user to only label these points. Although AL does not aim at producing labeled datasets, it helps in reducing the annotation cost while building machine learning models that generalize beyond the training data. A closer look at these labeling techniques, however, reveals several gaps and shortcomings [12–16]. On the one hand, since cheaper annotation methods are used in weak supervision, these sources are expected to overlap and conflict, which affects the quality of the resulting labels [12]. To estimate the level of noise in the generated labels, previous studies introduce the data programming (DP) paradigm [2, 12], which uses generative models to integrate the outcome of multiple weak supervision. Nevertheless, the uncertainty levels originating from these weak sources can complicate the process of learning the structure of these generative models [12]. Moreover, these approaches require users to design a set of user-defined heuristics [6] to encode their domain experience, which can be an expensive and time-consuming process [13]. On the other hand, active learning can be expensive when applied to high-dimensional datasets [14]. In pool-based settings [17], the active learner performs an iterative process to choose one or more points from an unlabeled pool to query the user in each iteration. This iterative process involves ranking all the points in the unlabeled pool, selecting the points for which true labels should be provided, training a model, and evaluating its performance using a held-out test set. Therefore, any imbalance between the sizes of the unlabeled pool and the labeled dataset can affect the time complexity of the process and increase the annotation cost [14]. Moreover, other studies [15, 16] show that in situations where the unlabeled data points cannot be entirely separated, active learning does not provide much superiority over passive learning. To overcome some of these challenges, we propose Asterisk, a framework to generate highquality training datasets at scale. An overview of the system is presented in Figure 1. As shown in the figure, instead of relying on the end-users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics (weak supervision sources) to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each and every iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. Asterisk: Generating Large Training Datasets with Automatic Active Supervision 13:3 Fig. 1. An overview of the proposed system. model their accuracies. To enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. As the figure shows, the refinement process can be repeated to further enhance the quality of the generated labels. Finally, the framework examines the refined labels and outputs a set of probabilistic labels that can be used to train any downstream classifier. A prototype implementation of the proposed framework is available at https://github.ibm.com/Mona-Nashaat-Ali-Elmowafy/Asterisk. To evaluate the proposed method, we compare its performance with the performances of four state-of-the-art techniques including data programming [2], automated weak supervision [13], and traditional active learning strategies [17]. During the experiments, we report the labeling accuracy, annotation cost, and performance of the end model trained with the generated labels. The primary contributions of this research can be summarized as follows: • An end-to-end labeling framework is proposed to create high-quality, large-scale training datasets. We describe the architecture of the proposed system, which includes a novel process of automatic generation of labeling heuristics instead of relying on the end-user to manually define the weak sources. • We propose a data-driven active learning process to enhance the accuracy of the generated weak labels. The process learns the selection policy while taking the distribution of the underlying data and the labeling confidence to optimize user engagement. • We applied a comprehensive set of experiments to evaluate the proposed method against state-of-the-art techniques. The experimental evaluation explores a wide range of domains with 10 datasets that vary in size and dimensionality with a maximum size of 11M records. We also use a real-world business dataset of 1.5M records provided by our industrial partner, IBM. The experiments also include a micro-benchmarking to evaluate the individual components of the proposed approach. The remainder of the article is structured as follows: Section 2 presents the background related to this research. Section 3 states, in detail, the design of the proposed solution. Section 4 presents ACM/IMS Transactions on Data Science, Vol. 1, No. 2, Article 13. Publication date: May 2020. 13:4 M. Nashaat et al. the performed experiments and reports the obtained results. Section 5 discusses related work, and Section 6 concludes the article.