基于分层时态记忆模型从文献中提取地球科学数据集名称

IF 2.8 3区 地球科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, Hang Feng
{"title":"基于分层时态记忆模型从文献中提取地球科学数据集名称","authors":"Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, Hang Feng","doi":"10.3390/ijgi13070260","DOIUrl":null,"url":null,"abstract":"Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.","PeriodicalId":48738,"journal":{"name":"ISPRS International Journal of Geo-Information","volume":"160 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model\",\"authors\":\"Kai Wu, Zugang Chen, Xinqian Wu, Guoqing Li, Jing Li, Shaohua Wang, Haodong Wang, Hang Feng\",\"doi\":\"10.3390/ijgi13070260\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.\",\"PeriodicalId\":48738,\"journal\":{\"name\":\"ISPRS International Journal of Geo-Information\",\"volume\":\"160 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISPRS International Journal of Geo-Information\",\"FirstCategoryId\":\"89\",\"ListUrlMain\":\"https://doi.org/10.3390/ijgi13070260\",\"RegionNum\":3,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS International Journal of Geo-Information","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.3390/ijgi13070260","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

从文献中提取地理科学数据集名称对于建立文献-数据关联网络至关重要,这有助于读者通过互联网快速获取数据。然而,现有的命名实体提取方法从非结构化文本中提取地理科学数据集名称的准确率较低,因为地理科学数据集名称是地理空间覆盖、时间覆盖、比例或分辨率、主题内容和版本等多元素的复杂组合。本文提出了一种基于分层时空记忆(HTM)模型的新方法,该模型是一种受大脑启发的神经网络,在高级认知任务中表现出色,可从非结构化文本中准确提取地理科学数据集名称。首先,为 HTM 模型提出了一种基于字符 Unicode 值的单词编码方法。然后,我们从地球科学数据共享网站上收集了超过 12,000 个数据集名称,并将其编码为二进制向量来训练 HTM 模型。我们为 HTM 模型构思了一种新的分类器方案,它能解码下一个词编码器的预测向量,从而计算预测下一个词的编码器与真实下一个词的相似度。如果相似度大于指定阈值,则真正的下一个词可被视为名称的一部分,而连续的词集则构成完整的地理科学数据集名称。我们使用训练有素的 HTM 模型提取了 100 篇论文中的地球科学数据集名称。我们的方法取得了 0.727 的 F1 分数,优于基于 GPT-4 和 Claude-3 的少量学习(FSL)方法,后者的 F1 分数分别为 0.698 和 0.72。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Extracting Geoscientific Dataset Names from the Literature Based on the Hierarchical Temporal Memory Model
Extracting geoscientific dataset names from the literature is crucial for building a literature–data association network, which can help readers access the data quickly through the Internet. However, the existing named-entity extraction methods have low accuracy in extracting geoscientific dataset names from unstructured text because geoscientific dataset names are a complex combination of multiple elements, such as geospatial coverage, temporal coverage, scale or resolution, theme content, and version. This paper proposes a new method based on the hierarchical temporal memory (HTM) model, a brain-inspired neural network with superior performance in high-level cognitive tasks, to accurately extract geoscientific dataset names from unstructured text. First, a word-encoding method based on the Unicode values of characters for the HTM model was proposed. Then, over 12,000 dataset names were collected from geoscience data-sharing websites and encoded into binary vectors to train the HTM model. We conceived a new classifier scheme for the HTM model that decodes the predictive vector for the encoder of the next word so that the similarity of the encoders of the predictive next word and the real next word can be computed. If the similarity is greater than a specified threshold, the real next word can be regarded as part of the name, and a successive word set forms the full geoscientific dataset name. We used the trained HTM model to extract geoscientific dataset names from 100 papers. Our method achieved an F1-score of 0.727, outperforming the GPT-4- and Claude-3-based few-shot learning (FSL) method, with F1-scores of 0.698 and 0.72, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ISPRS International Journal of Geo-Information
ISPRS International Journal of Geo-Information GEOGRAPHY, PHYSICALREMOTE SENSING&nb-REMOTE SENSING
CiteScore
6.90
自引率
11.80%
发文量
520
审稿时长
19.87 days
期刊介绍: ISPRS International Journal of Geo-Information (ISSN 2220-9964) provides an advanced forum for the science and technology of geographic information. ISPRS International Journal of Geo-Information publishes regular research papers, reviews and communications. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced. The 2018 IJGI Outstanding Reviewer Award has been launched! This award acknowledge those who have generously dedicated their time to review manuscripts submitted to IJGI. See full details at http://www.mdpi.com/journal/ijgi/awards.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信