Tagging terms in text

IF 0.9 4区文学 0 LANGUAGE & LINGUISTICS

Terminology Pub Date : 2022-01-10 DOI:10.1075/term.21010.rig

Ayla Rigouts Terryn, Veronique Hoste, Els Lefever

{"title":"Tagging terms in text","authors":"Ayla Rigouts Terryn, Veronique Hoste, Els Lefever","doi":"10.1075/term.21010.rig","DOIUrl":null,"url":null,"abstract":"\nAs with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.","PeriodicalId":44429,"journal":{"name":"Terminology","volume":" ","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2022-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Terminology","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1075/term.21010.rig","RegionNum":4,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.

查看原文本刊更多论文

标记文本中的术语

与自然语言处理中的许多任务一样，自动术语提取(ATE)越来越多地被视为机器学习问题。到目前为止，大多数用于ATE的机器学习方法都大致遵循传统的混合方法，首先提取唯一候选术语列表，然后根据预测的有效术语概率对这些候选术语进行分类。然而，随着神经网络和词嵌入的兴起，ATE的下一个发展方向可能是顺序方法，即在其原始上下文中对每个标记的每次出现进行分类。为了测试这些方法对ATE的有效性，我们开发、评估和比较了两种顺序方法:一种基于特征的条件随机场分类器和一种基于嵌入的递归神经网络。通过对传统方法的机器学习解释进行了额外的比较。所有系统都在多种语言和领域的相同数据上进行了训练和评估，以确定各自的优势和劣势。序列方法被证明是有效的ATE方法，神经网络甚至优于传统方法。有趣的是，多种方法的组合可以超越单独的所有方法，展示了在ATE中推动最先进技术的新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Terminology Multiple-

CiteScore

1.60

自引率

0.00%

发文量

期刊介绍： Terminology is an independent journal with a cross-cultural and cross-disciplinary scope. It focusses on the discussion of (systematic) solutions not only of language problems encountered in translation, but also, for example, of (monolingual) problems of ambiguity, reference and developments in multidisciplinary communication. Particular attention will be given to new and developing subject areas such as knowledge representation and transfer, information technology tools, expert systems and terminological databases. Terminology encompasses terminology both in general (theory and practice) and in specialized fields (LSP), such as physics.