对大型未标记数据的文本分类任务的准确度估计

Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania
{"title":"对大型未标记数据的文本分类任务的准确度估计","authors":"Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania","doi":"10.1145/1871437.1871551","DOIUrl":null,"url":null,"abstract":"Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.","PeriodicalId":310611,"journal":{"name":"Proceedings of the 19th ACM international conference on Information and knowledge management","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Estimating accuracy for text classification tasks on large unlabeled data\",\"authors\":\"Snigdha Chaturvedi, T. Faruquie, L. V. Subramaniam, M. Mohania\",\"doi\":\"10.1145/1871437.1871551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.\",\"PeriodicalId\":310611,\"journal\":{\"name\":\"Proceedings of the 19th ACM international conference on Information and knowledge management\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 19th ACM international conference on Information and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1871437.1871551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM international conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1871437.1871551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

用于处理文本数据的基于规则的系统将人类专家的知识编码到规则库中,以便根据输入数据和规则库的交互做出决策。类似地,基于监督学习的系统可以学习给定数据集中存在的模式,从而对类似和其他相关数据做出决策。这两类模型的性能在很大程度上取决于它们所看到的训练样例,学习是基于这些训练样例进行的。即使训练过的模型可以很好地适应训练数据,但它们在新的测试数据上产生的准确性可能会有很大的不同。在新的未标记数据集上计算学习模型的准确性是一个具有挑战性的问题,需要昂贵的标记,并且由于涉及的数据集规模很大,它仍然可能只覆盖新数据的一个子集。在本文中,我们提出了一种无需手动标记数据的方法来估计给定模型在新数据集上的准确性。我们在大型数据集上验证了我们的方法,用于两个浅层文本处理任务:文档分类和邮政地址分割,并使用监督机器学习方法和人类生成的基于规则的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Estimating accuracy for text classification tasks on large unlabeled data
Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信