使用HASSET同义词典自动索引的实验

2013 5th Computer Science and Electronic Engineering Conference (CEEC) Pub Date : 2013-11-11 DOI:10.1109/CEEC.2013.6659437

Mahmoud El-Haj, Lorna Balkan, Suzanne Barbalet, L. Bell, J. Shepherdson

{"title":"使用HASSET同义词典自动索引的实验","authors":"Mahmoud El-Haj, Lorna Balkan, Suzanne Barbalet, L. Bell, J. Shepherdson","doi":"10.1109/CEEC.2013.6659437","DOIUrl":null,"url":null,"abstract":"In this paper we present the tools, techniques and evaluation results of an automatic indexing experiment we conducted on the UK Data Archive/UK Data Service data-related document collection, as part of the Jisc-funded SKOS-HASSET project. We examined the quality of an automatic indexer based on a controlled vocabulary called the Humanities and Social Science Electronic Thesaurus (HASSET). We used the Keyphrase Extraction Algorithm (KEA), a text mining and a machine learning tool. KEA builds a classifier model using training documents with known keywords which is then applied to help assign keywords to new documents. We performed extensive manual and automatic evaluation on the results using recall, precision and F1 scores. The quality of the KEA indexing was measured a) automatically by the degree of overlap between the automated indexing decisions and those originally made by the human indexer and b) manually by comparing KEA's output with the source text. This paper explains how and why we applied the chosen technical solutions, and how we intend to take forward any lessons learned from this work in the future.","PeriodicalId":309053,"journal":{"name":"2013 5th Computer Science and Electronic Engineering Conference (CEEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"An experiment in automatic indexing using the HASSET thesaurus\",\"authors\":\"Mahmoud El-Haj, Lorna Balkan, Suzanne Barbalet, L. Bell, J. Shepherdson\",\"doi\":\"10.1109/CEEC.2013.6659437\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present the tools, techniques and evaluation results of an automatic indexing experiment we conducted on the UK Data Archive/UK Data Service data-related document collection, as part of the Jisc-funded SKOS-HASSET project. We examined the quality of an automatic indexer based on a controlled vocabulary called the Humanities and Social Science Electronic Thesaurus (HASSET). We used the Keyphrase Extraction Algorithm (KEA), a text mining and a machine learning tool. KEA builds a classifier model using training documents with known keywords which is then applied to help assign keywords to new documents. We performed extensive manual and automatic evaluation on the results using recall, precision and F1 scores. The quality of the KEA indexing was measured a) automatically by the degree of overlap between the automated indexing decisions and those originally made by the human indexer and b) manually by comparing KEA's output with the source text. This paper explains how and why we applied the chosen technical solutions, and how we intend to take forward any lessons learned from this work in the future.\",\"PeriodicalId\":309053,\"journal\":{\"name\":\"2013 5th Computer Science and Electronic Engineering Conference (CEEC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 5th Computer Science and Electronic Engineering Conference (CEEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CEEC.2013.6659437\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 5th Computer Science and Electronic Engineering Conference (CEEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEEC.2013.6659437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

在本文中，我们介绍了我们在英国数据档案馆/英国数据服务数据相关文档收集上进行的自动索引实验的工具、技术和评估结果，该实验是jiscs资助的SKOS-HASSET项目的一部分。我们检查了一个基于受控词汇的自动索引器的质量，这个受控词汇被称为人文社会科学电子同义词库(HASSET)。我们使用了关键词提取算法(KEA)、文本挖掘和机器学习工具。KEA使用具有已知关键字的训练文档构建分类器模型，然后应用该模型帮助将关键字分配给新文档。我们使用召回率、精度和F1分数对结果进行了广泛的手动和自动评估。KEA索引的质量是a)通过自动索引决策与最初由人工索引人员做出的决策之间的重叠程度来自动测量的，b)通过将KEA的输出与源文本进行比较来手动测量的。本文解释了我们如何以及为什么应用所选择的技术解决方案，以及我们打算如何在未来从这项工作中吸取经验教训。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An experiment in automatic indexing using the HASSET thesaurus

In this paper we present the tools, techniques and evaluation results of an automatic indexing experiment we conducted on the UK Data Archive/UK Data Service data-related document collection, as part of the Jisc-funded SKOS-HASSET project. We examined the quality of an automatic indexer based on a controlled vocabulary called the Humanities and Social Science Electronic Thesaurus (HASSET). We used the Keyphrase Extraction Algorithm (KEA), a text mining and a machine learning tool. KEA builds a classifier model using training documents with known keywords which is then applied to help assign keywords to new documents. We performed extensive manual and automatic evaluation on the results using recall, precision and F1 scores. The quality of the KEA indexing was measured a) automatically by the degree of overlap between the automated indexing decisions and those originally made by the human indexer and b) manually by comparing KEA's output with the source text. This paper explains how and why we applied the chosen technical solutions, and how we intend to take forward any lessons learned from this work in the future.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 5th Computer Science and Electronic Engineering Conference (CEEC)

自引率

0.00%

发文量