标记数据集的特征和使用- ODP案例研究

2010 Sixth International Conference on Semantics, Knowledge and Grids Pub Date : 2010-11-01 DOI:10.1109/SKG.2010.84

Dengya Zhu, H. Dreher

{"title":"标记数据集的特征和使用- ODP案例研究","authors":"Dengya Zhu, H. Dreher","doi":"10.1109/SKG.2010.84","DOIUrl":null,"url":null,"abstract":"Labeled datasets are essential for text categorization. They are used to train a classifier, or as a benchmark collection to evaluate categorization algorithms. However, labeling a large-scale document set is extremely expensive because it involves much human labour, and the labeling process itself is subjective rather than objective. Therefore, labels assigned to documents by only one human editor in some existing labeled document sets may be of limited use and may prove problematic for training a classifier or evaluating categorization algorithms. This research explores socially constructed Web directory, the Open Directory Project (ODP), to generate a series of labeled document sets by extracting semantic characteristics from the ODP categories which are annotated by a list of indexed Websites. The generated document sets are used to classify Web search results and the results are encouraging.","PeriodicalId":105513,"journal":{"name":"2010 Sixth International Conference on Semantics, Knowledge and Grids","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Characteristics and Uses of Labeled Datasets - ODP Case Study\",\"authors\":\"Dengya Zhu, H. Dreher\",\"doi\":\"10.1109/SKG.2010.84\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Labeled datasets are essential for text categorization. They are used to train a classifier, or as a benchmark collection to evaluate categorization algorithms. However, labeling a large-scale document set is extremely expensive because it involves much human labour, and the labeling process itself is subjective rather than objective. Therefore, labels assigned to documents by only one human editor in some existing labeled document sets may be of limited use and may prove problematic for training a classifier or evaluating categorization algorithms. This research explores socially constructed Web directory, the Open Directory Project (ODP), to generate a series of labeled document sets by extracting semantic characteristics from the ODP categories which are annotated by a list of indexed Websites. The generated document sets are used to classify Web search results and the results are encouraging.\",\"PeriodicalId\":105513,\"journal\":{\"name\":\"2010 Sixth International Conference on Semantics, Knowledge and Grids\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 Sixth International Conference on Semantics, Knowledge and Grids\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SKG.2010.84\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Sixth International Conference on Semantics, Knowledge and Grids","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SKG.2010.84","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

标记数据集对于文本分类是必不可少的。它们被用来训练分类器，或者作为评估分类算法的基准集合。然而，标记一个大规模的文档集是非常昂贵的，因为它涉及到大量的人力劳动，并且标记过程本身是主观的而不是客观的。因此，在一些现有的标记文档集中，仅由一个人工编辑分配给文档的标签可能用途有限，并且可能在训练分类器或评估分类算法时存在问题。本研究探索了社会构建的网络目录，开放目录项目(ODP)，通过从ODP类别中提取语义特征来生成一系列标记的文档集，这些类别由索引网站列表注释。生成的文档集用于对Web搜索结果进行分类，结果令人鼓舞。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Characteristics and Uses of Labeled Datasets - ODP Case Study

Labeled datasets are essential for text categorization. They are used to train a classifier, or as a benchmark collection to evaluate categorization algorithms. However, labeling a large-scale document set is extremely expensive because it involves much human labour, and the labeling process itself is subjective rather than objective. Therefore, labels assigned to documents by only one human editor in some existing labeled document sets may be of limited use and may prove problematic for training a classifier or evaluating categorization algorithms. This research explores socially constructed Web directory, the Open Directory Project (ODP), to generate a series of labeled document sets by extracting semantic characteristics from the ODP categories which are annotated by a list of indexed Websites. The generated document sets are used to classify Web search results and the results are encouraging.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 Sixth International Conference on Semantics, Knowledge and Grids

自引率

0.00%

发文量