A Pattern and POS Auto-Learning Method for Terminology Extraction from Scientific Text

Wei Shao , Bolin Hua , Linqi Song
{"title":"A Pattern and POS Auto-Learning Method for Terminology Extraction from Scientific Text","authors":"Wei Shao ,&nbsp;Bolin Hua ,&nbsp;Linqi Song","doi":"10.2478/dim-2021-0005","DOIUrl":null,"url":null,"abstract":"<div><p>A lot of new scientific documents are being published on various platforms every day. It is more and more imperative to quickly and efficiently discover new words and meanings from these documents. However, most of the related works rely on labeled data, and it is quite difficult to deal with unlabeled new documents efficiently. For this, we have introduced an unsupervised method based on sentence patterns and part of speech (POS) sequences. Our method just needs a few initial learnable patterns to obtain the initial terminology tokens and their POS sequences. In this process, new patterns are constructed and can match more sentences to find more POS sequences of terminology. Finally, we use obtained POS sequences and sentence patterns to extract terminology terms in new scientific text. Experiments on paper abstracts from Web of Knowledge show that this method is practical and can achieve a good performance on our test data.</p></div>","PeriodicalId":72769,"journal":{"name":"Data and information management","volume":"5 3","pages":"Pages 329-335"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2543925122000031/pdfft?md5=def416db2e2762263b15157e5919b4c2&pid=1-s2.0-S2543925122000031-main.pdf","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and information management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2543925122000031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

A lot of new scientific documents are being published on various platforms every day. It is more and more imperative to quickly and efficiently discover new words and meanings from these documents. However, most of the related works rely on labeled data, and it is quite difficult to deal with unlabeled new documents efficiently. For this, we have introduced an unsupervised method based on sentence patterns and part of speech (POS) sequences. Our method just needs a few initial learnable patterns to obtain the initial terminology tokens and their POS sequences. In this process, new patterns are constructed and can match more sentences to find more POS sequences of terminology. Finally, we use obtained POS sequences and sentence patterns to extract terminology terms in new scientific text. Experiments on paper abstracts from Web of Knowledge show that this method is practical and can achieve a good performance on our test data.

科技文本术语抽取的模式与词性自动学习方法
每天都有大量新的科学文献在各种平台上发表。从这些文档中快速有效地发现新词和词义变得越来越重要。然而,大多数相关工作依赖于标记数据,有效地处理未标记的新文档是相当困难的。为此,我们提出了一种基于句型和词性序列的无监督方法。我们的方法只需要一些初始的可学习模式来获得初始术语令牌及其POS序列。在这个过程中,新的模式被构建,并且可以匹配更多的句子,从而找到更多的术语的词序。最后,利用获得的词序和句式对新科学文本中的术语进行提取。在Web of Knowledge的论文摘要上进行的实验表明,该方法是实用的,可以在我们的测试数据上取得良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data and information management
Data and information management Management Information Systems, Library and Information Sciences
CiteScore
3.70
自引率
0.00%
发文量
0
审稿时长
55 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信