{"title":"Learning before Learning: Reversing Validation and Training","authors":"S. Simske, M. Vans","doi":"10.1145/3103010.3121044","DOIUrl":null,"url":null,"abstract":"In the world of ground truthing--that is, the collection of highly valuable labeled training and validation data-there is a tendency to follow the path of first training on a set of data, then validating the data, and then testing the data. However, in many cases the labeled training data is of non-uniform quality, and thus of non-uniform value for assessing the accuracy and other performance indicators for analytics algorithms, systems and processes. This means that one or more of the so-labeled classes is likely a mixture of two or more clusters or sub-classes. These data may inhibit our ability to assess the classifier to use for deployment. We argue that one must learn about the labeled data before the labeled data can be used for downstream machine learning; that is, we reverse the validation and training steps in building the classifier. This \"learning before learning\" is assessed using a CNN corpus (cnn.com) which was hand-labeled as comprising 12 classes. We show how the suspect classes are identified using the initial validation, and how training after validation occurs. We then apply this process to the CNN corpus and show that it consists of 9 high-quality classes and three mixed-quality classes. The effects of this validation-training approach is then shown and discussed.","PeriodicalId":200469,"journal":{"name":"Proceedings of the 2017 ACM Symposium on Document Engineering","volume":"94 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3103010.3121044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the world of ground truthing--that is, the collection of highly valuable labeled training and validation data-there is a tendency to follow the path of first training on a set of data, then validating the data, and then testing the data. However, in many cases the labeled training data is of non-uniform quality, and thus of non-uniform value for assessing the accuracy and other performance indicators for analytics algorithms, systems and processes. This means that one or more of the so-labeled classes is likely a mixture of two or more clusters or sub-classes. These data may inhibit our ability to assess the classifier to use for deployment. We argue that one must learn about the labeled data before the labeled data can be used for downstream machine learning; that is, we reverse the validation and training steps in building the classifier. This "learning before learning" is assessed using a CNN corpus (cnn.com) which was hand-labeled as comprising 12 classes. We show how the suspect classes are identified using the initial validation, and how training after validation occurs. We then apply this process to the CNN corpus and show that it consists of 9 high-quality classes and three mixed-quality classes. The effects of this validation-training approach is then shown and discussed.