A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning

Journal of Data and Information Quality (JDIQ) Pub Date : 2019-08-19 DOI:10.1145/3317573

Junhua Ding, Xinchuan Li, Xiaojun Kang, V. Gudivada

{"title":"A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning","authors":"Junhua Ding, Xinchuan Li, Xiaojun Kang, V. Gudivada","doi":"10.1145/3317573","DOIUrl":null,"url":null,"abstract":"Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are “fit for purpose” of deep learning is still a question. A framework for comprehensively evaluating the effectiveness of the augmented data for deep learning is still not available. In this article, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. To evaluate the quality of the augmented data in fidelity, variety, and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 22"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3317573","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are “fit for purpose” of deep learning is still a question. A framework for comprehensively evaluating the effectiveness of the augmented data for deep learning is still not available. In this article, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. To evaluate the quality of the augmented data in fidelity, variety, and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.

查看原文本刊更多论文

深度学习训练数据增强与评价的案例研究

深度学习被广泛用于从大数据中提取价值。与许多其他机器学习算法一样，深度学习需要大量的训练数据。实验表明，训练数据的数量和质量都会显著影响值提取的有效性。在某些情况下，训练数据量不足以有效地训练深度学习模型。在其他情况下，训练数据的质量不够高，无法达到最优性能。已经提出了许多方法来增加训练数据以减轻这一缺陷。然而，增强的数据是否“适合深度学习的目的”仍然是一个问题。一个全面评估深度学习增强数据有效性的框架仍然不可用。在本文中，我们首先讨论深度学习的数据增强方法。该方法包括两个部分:第一部分是使用基于机器学习的分类方法去除数据集中的噪声数据，以提高数据集的质量;第二部分是增加数据集的体积，以有效地训练深度学习模型。为了评估增强数据的保真度、多样性和准确性，提出了一个数据质量评估框架。我们通过研究使用深度学习的生物细胞图像自动分类，证明了数据增强方法和数据质量评估框架的有效性。实验结果清楚地表明了训练数据的数量和质量对深度学习性能的影响以及数据质量评估的重要性。数据增强方法和数据质量评估框架可以直接适用于其他领域的深度学习研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量