Word Embeddings vs Word Types for Sequence Labeling: the Curious Case of CV Parsing

VS@HLT-NAACL Pub Date : 2015-06-01 DOI:10.3115/v1/W15-1517

Melanie Tosik, C. Hansen, Gerard Goossen, M. Rotaru

引用次数: 13

Abstract

We explore new methods of improving Curriculum Vitae (CV) parsing for German documents by applying recent research on the application of word embeddings in Natural Language Processing (NLP). Our approach integrates the word embeddings as input features for a probabilistic sequence labeling model that relies on the Conditional Random Field (CRF) framework. Best-performing word embeddings are generated from a large sample of German CVs. The best results on the extraction task are obtained by the model which integrates the word embeddings together with a number of hand-crafted features. The improvements are consistent throughout different sections of the target documents. The effect of the word embeddings is strongest on semi-structured, out-of-sample data.

查看原文本刊更多论文

序列标注的词嵌入与词类型:CV解析的奇特案例

本文通过应用词嵌入在自然语言处理(NLP)中的最新研究，探索了改进德文文档简历(CV)解析的新方法。我们的方法集成了词嵌入作为依赖于条件随机场(CRF)框架的概率序列标记模型的输入特征。表现最好的词嵌入是从大量德国简历样本中生成的。该模型将词嵌入与许多手工特征相结合，在提取任务中获得了最好的结果。这些改进在目标文档的不同部分是一致的。词嵌入对半结构化、样本外数据的影响是最强的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

VS@HLT-NAACL

自引率

0.00%

发文量