部分注释数据集的命名实体识别

International Conference on Applications of Natural Language to Data Bases Pub Date : 2022-04-19 DOI:10.48550/arXiv.2204.09081

Michael Strobl, Amine Trabelsi, Osmar R Zaiane

{"title":"部分注释数据集的命名实体识别","authors":"Michael Strobl, Amine Trabelsi, Osmar R Zaiane","doi":"10.48550/arXiv.2204.09081","DOIUrl":null,"url":null,"abstract":"The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.","PeriodicalId":136374,"journal":{"name":"International Conference on Applications of Natural Language to Data Bases","volume":"168 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Named Entity Recognition for Partially Annotated Datasets\",\"authors\":\"Michael Strobl, Amine Trabelsi, Osmar R Zaiane\",\"doi\":\"10.48550/arXiv.2204.09081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.\",\"PeriodicalId\":136374,\"journal\":{\"name\":\"International Conference on Applications of Natural Language to Data Bases\",\"volume\":\"168 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Applications of Natural Language to Data Bases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2204.09081\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Applications of Natural Language to Data Bases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2204.09081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最常见的命名实体识别器通常是在完全注释的语料库上训练的序列标记器，即所有实体的所有单词的类别都是已知的。部分标注的语料库，即某些类型的一些实体被标注，但不是所有实体都被标注，对于训练序列标注器来说太吵了，因为同一实体可能会用其真实类型标注一次，而不是另一次，这会误导标注器。因此，我们比较了三种针对部分注释数据集的训练策略，以及一种无需耗时的手动数据注释就能从维基百科中为新类别的实体派生新数据集的方法。为了正确验证我们的数据采集和训练方法是合理的，我们手动注释了两个新类别的测试数据集，即食品和药品。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Named Entity Recognition for Partially Annotated Datasets

The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Applications of Natural Language to Data Bases

自引率

0.00%

发文量