Paul Rayson, Alexander Reinhold, J. Butler, Christopher Donaldson, I. Gregory, Joanna E. Taylor
{"title":"地理文本分析的深度标注试验台:湖区写作语料库","authors":"Paul Rayson, Alexander Reinhold, J. Butler, Christopher Donaldson, I. Gregory, Joanna E. Taylor","doi":"10.1145/3149858.3149865","DOIUrl":null,"url":null,"abstract":"This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.","PeriodicalId":93223,"journal":{"name":"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)","volume":"12 1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing\",\"authors\":\"Paul Rayson, Alexander Reinhold, J. Butler, Christopher Donaldson, I. Gregory, Joanna E. Taylor\",\"doi\":\"10.1145/3149858.3149865\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.\",\"PeriodicalId\":93223,\"journal\":{\"name\":\"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)\",\"volume\":\"12 1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3149858.3149865\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. ACM SIGSPATIAL Workshop on Geospatial Humanities (1st : 2017 : Redondo Beach, Calif.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3149858.3149865","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing
This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.