对大型未标记数据的文本分类任务的准确度估计

Proceedings of the 19th ACM international conference on Information and knowledge management Pub Date : 2010-10-26 DOI:10.1145/1871437.1871551

Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania

{"title":"对大型未标记数据的文本分类任务的准确度估计","authors":"Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania","doi":"10.1145/1871437.1871551","DOIUrl":null,"url":null,"abstract":"Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Estimating accuracy for text classification tasks on large unlabeled data\",\"authors\":\"Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania\",\"doi\":\"10.1145/1871437.1871551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.\",\"PeriodicalId\":310611,\"journal\":{\"name\":\"Proceedings of the 19th ACM international conference on Information and knowledge management\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th ACM international conference on Information and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1871437.1871551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM international conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1871437.1871551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

用于处理文本数据的基于规则的系统将人类专家的知识编码到规则库中，以便根据输入数据和规则库的交互做出决策。类似地，基于监督学习的系统可以学习给定数据集中存在的模式，从而对类似和其他相关数据做出决策。这两类模型的性能在很大程度上取决于它们所看到的训练样例，学习是基于这些训练样例进行的。即使训练过的模型可以很好地适应训练数据，但它们在新的测试数据上产生的准确性可能会有很大的不同。在新的未标记数据集上计算学习模型的准确性是一个具有挑战性的问题，需要昂贵的标记，并且由于涉及的数据集规模很大，它仍然可能只覆盖新数据的一个子集。在本文中，我们提出了一种无需手动标记数据的方法来估计给定模型在新数据集上的准确性。我们在大型数据集上验证了我们的方法，用于两个浅层文本处理任务:文档分类和邮政地址分割，并使用监督机器学习方法和人类生成的基于规则的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Estimating accuracy for text classification tasks on large unlabeled data

Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 19th ACM international conference on Information and knowledge management

自引率

0.00%

发文量