Semi-supervised Learning of Domain-Specific Language Models from General Domain Data

2009 International Conference on Asian Language Processing Pub Date : 2009-12-01 DOI:10.1109/IALP.2009.65

Shuanhu Bai, Min Zhang, Haizhou Li

引用次数: 0

Abstract

We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modeling scheme of PLSA. Finally, it uses traditional n-gram modeling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.

查看原文本刊更多论文

基于一般领域数据的领域特定语言模型的半监督学习

提出了一种半监督学习方法，用于从一般领域数据中构建领域特定语言模型(LM)。该方法旨在借助主题建模技术，以少量的领域特定数据作为种子，挖掘存在于大量通用领域数据中的领域特定资源。该算法首先利用概率潜在语义分析(PLSA)对特定领域和通用领域数据的组合数据集进行主题分解(TD)。然后利用PLSA的混合建模方案推导出特定领域的词n-gram计数。最后，使用传统的n-gram建模方法从特定领域的单词n-gram计数构建特定领域的lm。实验结果表明，在我们的数据集上，该方法可以优于最先进的方法和模拟监督学习方法。特别是，半监督学习方法即使在非常少量的特定领域数据下也能获得更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 International Conference on Asian Language Processing

自引率

0.00%

发文量