Supervised learning for robust term extraction

2017 International Conference on Asian Language Processing (IALP) Pub Date : 2017-12-01 DOI:10.1109/IALP.2017.8300603

Yu Yuan, Jie Gao, Yue Zhang

引用次数: 11

Abstract

We propose a machine learning method to automatically classify the extracted ngrams from a corpus into terms and non-terms. We use 10 common statistics in previous term extraction literature as features for training. The proposed method, applicable to term recognition in multiple domains and languages, can help 1) avoid the laborious work in the post-processing (e.g. subjective threshold setting); 2) handle the skewness and demonstrate noticeable resilience to domain-shift issue of training data. Experiments are carried out on 6 corpora of multiple domains and languages, including GENIA and ACLRD-TEC(1.0) corpus as training set and four TTC subcorpora of wind energy and mobile technology in both Chinese and English as test set. Promising results are found, which indicate that this approach is capable of identifying both single word terms and multiword terms with reasonably good precision and recall.

查看原文本刊更多论文

鲁棒术语提取的监督学习

我们提出了一种机器学习方法，将从语料库中提取的图像自动分类为术语和非术语。我们使用之前的术语提取文献中的10个常见统计量作为特征进行训练。该方法适用于多领域和多语言的术语识别，可以避免繁琐的后处理工作(如主观阈值设置);2)处理偏度，并对训练数据的领域转移问题表现出明显的弹性。实验在6个多领域多语言的语料库上进行，包括GENIA和ACLRD-TEC(1.0)语料库作为训练集，风能和移动技术的4个TTC中英文子语料库作为测试集。结果表明，该方法既能识别单字词汇，又能识别多词词汇，具有较好的查全率和查准率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on Asian Language Processing (IALP)

自引率

0.00%

发文量