地理文本分析的深度标注试验台:湖区写作语料库

Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.) Pub Date : 2017-11-07 DOI:10.1145/3149858.3149865

Paul Rayson, Alexander Reinhold, J. Butler, Christopher Donaldson, I. Gregory, Joanna E. Taylor

{"title":"地理文本分析的深度标注试验台:湖区写作语料库","authors":"Paul Rayson, Alexander Reinhold, J. Butler, Christopher Donaldson, I. Gregory, Joanna E. Taylor","doi":"10.1145/3149858.3149865","DOIUrl":null,"url":null,"abstract":"This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.","PeriodicalId":93223,"journal":{"name":"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)","volume":"12 1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing\",\"authors\":\"Paul Rayson, Alexander Reinhold, J. Butler, Christopher Donaldson, I. Gregory, Joanna E. Taylor\",\"doi\":\"10.1145/3149858.3149865\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.\",\"PeriodicalId\":93223,\"journal\":{\"name\":\"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)\",\"volume\":\"12 1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3149858.3149865\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3149858.3149865","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

本文描述了一个标注语料库的开发，它为地理文本分析方法提供了一个具有挑战性的测试平台。该数据集是湖区写作语料库(CLDW)，由80个手动数字化和注释的文本组成(包含超过150万个单词标记)。这些文本最初创作于1622年至1900年之间，它们代表了一系列不同的流派和作者。总的来说，CLDW中的文本构成了17世纪初和20世纪初英国湖区写作的指示性样本。语料库的注释比目前使用命名实体识别、消歧义和地质解析更深入。语料库中包含的地理信息尤其如此，因为我们不仅要将地名的不同历史和拼写变体联系起来，还要识别和区分瀑布、林地、农场或客栈等地理特征。此外，我们通过评估三个不同的NLP库和地质分析仪对其内容的结果，说明了语料库作为金标准的潜力。在评价中，不同的NLP库对文本的标准NER处理产生了许多假阳性和假阴性结果，显示了金标准的强度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing

This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)

自引率

0.00%

发文量