Recognition of irrelevant phrases in automatically extracted lists of domain terms

A. Mykowiecka, M. Marciniak, P. Rychlik
{"title":"Recognition of irrelevant phrases in automatically extracted lists of domain terms","authors":"A. Mykowiecka, M. Marciniak, P. Rychlik","doi":"10.1075/TERM.00014.MYK","DOIUrl":null,"url":null,"abstract":"\n In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora.\n The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method, with a precision of about 0.75 on half of the tested list, was the context based method using a modified contextual diversity coefficient.\n Although the methods were tested on Polish, they seems to be language independent.","PeriodicalId":162784,"journal":{"name":"Computational terminology and filtering of terminological information","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational terminology and filtering of terminological information","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/TERM.00014.MYK","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method, with a precision of about 0.75 on half of the tested list, was the context based method using a modified contextual diversity coefficient. Although the methods were tested on Polish, they seems to be language independent.
在自动提取的领域术语列表中识别不相关的短语
在我们的论文中,我们解决了用自动术语提取工具获得的术语表中不相关短语的识别问题。我们专注于识别多词短语,即一般术语或话语表达。我们定义了几种基于领域语料库比较的方法和一种基于大型通用语言语料库中识别的短语上下文的方法。这些方法在波兰的数据上进行了测试。我们使用了6个领域语料库和1个通用语料库。准备了两个测试集来评估这些方法。第一个由许多可能不相关的短语组成,因为我们选择了至少在三个领域语料库中出现的短语。第二组主要由领域术语组成,因为它是由从分析的领域语料库中自动提取的排名最高的短语组成的。结果表明,由于标注者之间的一致性较低,任务相当困难。几个测试的方法获得了类似的总体结果,尽管不同方法之间的短语顺序有所不同。最成功的方法是使用修改的上下文多样性系数的基于上下文的方法,在一半的测试列表中精度约为0.75。虽然这些方法是用波兰语测试的,但它们似乎是独立于语言的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信