{"title":"A Pattern and POS Auto-Learning Method for Terminology Extraction from Scientific Text","authors":"Wei Shao , Bolin Hua , Linqi Song","doi":"10.2478/dim-2021-0005","DOIUrl":null,"url":null,"abstract":"<div><p>A lot of new scientific documents are being published on various platforms every day. It is more and more imperative to quickly and efficiently discover new words and meanings from these documents. However, most of the related works rely on labeled data, and it is quite difficult to deal with unlabeled new documents efficiently. For this, we have introduced an unsupervised method based on sentence patterns and part of speech (POS) sequences. Our method just needs a few initial learnable patterns to obtain the initial terminology tokens and their POS sequences. In this process, new patterns are constructed and can match more sentences to find more POS sequences of terminology. Finally, we use obtained POS sequences and sentence patterns to extract terminology terms in new scientific text. Experiments on paper abstracts from Web of Knowledge show that this method is practical and can achieve a good performance on our test data.</p></div>","PeriodicalId":72769,"journal":{"name":"Data and information management","volume":"5 3","pages":"Pages 329-335"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2543925122000031/pdfft?md5=def416db2e2762263b15157e5919b4c2&pid=1-s2.0-S2543925122000031-main.pdf","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and information management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2543925122000031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
A lot of new scientific documents are being published on various platforms every day. It is more and more imperative to quickly and efficiently discover new words and meanings from these documents. However, most of the related works rely on labeled data, and it is quite difficult to deal with unlabeled new documents efficiently. For this, we have introduced an unsupervised method based on sentence patterns and part of speech (POS) sequences. Our method just needs a few initial learnable patterns to obtain the initial terminology tokens and their POS sequences. In this process, new patterns are constructed and can match more sentences to find more POS sequences of terminology. Finally, we use obtained POS sequences and sentence patterns to extract terminology terms in new scientific text. Experiments on paper abstracts from Web of Knowledge show that this method is practical and can achieve a good performance on our test data.
每天都有大量新的科学文献在各种平台上发表。从这些文档中快速有效地发现新词和词义变得越来越重要。然而,大多数相关工作依赖于标记数据,有效地处理未标记的新文档是相当困难的。为此,我们提出了一种基于句型和词性序列的无监督方法。我们的方法只需要一些初始的可学习模式来获得初始术语令牌及其POS序列。在这个过程中,新的模式被构建,并且可以匹配更多的句子,从而找到更多的术语的词序。最后,利用获得的词序和句式对新科学文本中的术语进行提取。在Web of Knowledge的论文摘要上进行的实验表明,该方法是实用的,可以在我们的测试数据上取得良好的性能。