Comparing Topic Modeling and Named Entity Recognition Techniques for the Semantic Indexing of a Landscape Architecture Textbook

2019 Systems and Information Engineering Design Symposium (SIEDS) Pub Date : 2019-04-01 DOI:10.1109/SIEDS.2019.8735642

K. Dawar, Ashwanth J. Samuel, Raf Alvarado

{"title":"Comparing Topic Modeling and Named Entity Recognition Techniques for the Semantic Indexing of a Landscape Architecture Textbook","authors":"K. Dawar, Ashwanth J. Samuel, Raf Alvarado","doi":"10.1109/SIEDS.2019.8735642","DOIUrl":null,"url":null,"abstract":"The task of manually annotating text is often tedious and error-prone. There is a strong need to digitize landscape history because a scalable, relational database with refined texts simply does not exist, ultimately limiting the pedagogical extent of this rich field. The data for the study conducted is a comprehensive textbook (544 pages) titled, “Landscape Design: A History of Landscape Architecture,” by Elizabeth Rogers. The Landscape Studies Initiative and Data Science Institute at the University of Virginia have partnered together to construct a SQL aided application (Flask) that will assist in deep annotation of scholarly texts. Our goal was to utilize machine learning techniques, specifically named entity recognition models (NER) and topic models (TM), not only to optimize the annotation process, but also to provide a fresh perspective on the text through a new index. In this paper, we will look at the training system, design, and architecture of several different NER models, including Python's spaCy, Stanford's Named Entity Recognizer, and IBM Bluemix's Natural Language Understanding tool, and compare their accuracies. Additionally, this paper aims to explore topic modeling from different tools and techniques, such as the Python libraries Gensim and Mallet in order to compare and contrast the relevance of those models to our dataset. The impact that these techniques have on the humanities fields can be astoundingly influential, but severely limited by the availability, size, and domain of the training dataset. Entity Recognition and Topic Modeling, as a result, are far from solved tasks: we will address some of the fundamental challenges that can prevent these systems from being robust and accurate.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735642","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The task of manually annotating text is often tedious and error-prone. There is a strong need to digitize landscape history because a scalable, relational database with refined texts simply does not exist, ultimately limiting the pedagogical extent of this rich field. The data for the study conducted is a comprehensive textbook (544 pages) titled, “Landscape Design: A History of Landscape Architecture,” by Elizabeth Rogers. The Landscape Studies Initiative and Data Science Institute at the University of Virginia have partnered together to construct a SQL aided application (Flask) that will assist in deep annotation of scholarly texts. Our goal was to utilize machine learning techniques, specifically named entity recognition models (NER) and topic models (TM), not only to optimize the annotation process, but also to provide a fresh perspective on the text through a new index. In this paper, we will look at the training system, design, and architecture of several different NER models, including Python's spaCy, Stanford's Named Entity Recognizer, and IBM Bluemix's Natural Language Understanding tool, and compare their accuracies. Additionally, this paper aims to explore topic modeling from different tools and techniques, such as the Python libraries Gensim and Mallet in order to compare and contrast the relevance of those models to our dataset. The impact that these techniques have on the humanities fields can be astoundingly influential, but severely limited by the availability, size, and domain of the training dataset. Entity Recognition and Topic Modeling, as a result, are far from solved tasks: we will address some of the fundamental challenges that can prevent these systems from being robust and accurate.

查看原文本刊更多论文

主题建模与命名实体识别技术在园林教材语义索引中的比较

手动注释文本的任务通常是乏味且容易出错的。我们迫切需要将景观历史数字化，因为没有一个可扩展的、包含精炼文本的关系数据库，最终限制了这一丰富领域的教学范围。该研究的资料是伊丽莎白·罗杰斯(Elizabeth Rogers)的《景观设计:景观建筑史》(544页)。景观研究计划和弗吉尼亚大学的数据科学研究所合作构建了一个SQL辅助应用程序(Flask)，它将有助于对学术文本进行深度注释。我们的目标是利用机器学习技术，特别是实体识别模型(NER)和主题模型(TM)，不仅可以优化标注过程，还可以通过新的索引为文本提供新的视角。在本文中，我们将研究几种不同的NER模型的训练系统、设计和架构，包括Python的spaCy、斯坦福大学的命名实体识别器和IBM Bluemix的自然语言理解工具，并比较它们的准确性。此外，本文旨在探索来自不同工具和技术的主题建模，例如Python库Gensim和Mallet，以便比较和对比这些模型与我们的数据集的相关性。这些技术对人文领域的影响可能是惊人的，但受到训练数据集的可用性、大小和领域的严重限制。因此，实体识别和主题建模还远远没有解决:我们将解决一些基本的挑战，这些挑战可能会阻碍这些系统的鲁棒性和准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 Systems and Information Engineering Design Symposium (SIEDS)

自引率

0.00%

发文量