Domain Adaptation for Text Classification with Weird Embeddings

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020 Pub Date : 1900-01-01 DOI:10.4000/books.aaccademia.8250

Valerio Basile

引用次数: 3

Abstract

Pre-trained word embeddings are often used to initialize deep learning models for text classification, as a way to inject precomputed lexical knowledge and boost the learning process. However, such embeddings are usually trained on generic corpora, while text classification tasks are often domain-specific. We propose a fully automated method to adapt pre-trained word embeddings to any given classification task, that needs no additional resource other than the original training set. The method is based on the concept of word weirdness, extended to score the words in the training set according to how characteristic they are with respect to the labels of a text classification dataset. The polarized weirdness scores are then used to update the word embeddings to reflect taskspecific semantic shifts. Our experiments show that this method is beneficial to the performance of several text classification tasks in different languages.

查看原文本刊更多论文

怪异嵌入文本分类的领域自适应

预训练词嵌入通常用于初始化文本分类的深度学习模型，作为一种注入预先计算的词汇知识并促进学习过程的方法。然而，这种嵌入通常是在通用语料库上训练的，而文本分类任务通常是特定于领域的。我们提出了一种完全自动化的方法，使预训练的词嵌入适应任何给定的分类任务，除了原始训练集之外，不需要额外的资源。该方法基于单词怪异度的概念，扩展到根据训练集中的单词相对于文本分类数据集的标签的特征程度对单词进行评分。然后使用极化怪异度分数来更新词嵌入，以反映特定任务的语义变化。实验结果表明，该方法对不同语言文本分类任务的性能有较好的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

自引率

0.00%

发文量