Semi-supervised Learning of Domain-Specific Language Models from General Domain Data

Shuanhu Bai, Min Zhang, Haizhou Li
{"title":"Semi-supervised Learning of Domain-Specific Language Models from General Domain Data","authors":"Shuanhu Bai, Min Zhang, Haizhou Li","doi":"10.1109/IALP.2009.65","DOIUrl":null,"url":null,"abstract":"We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modeling scheme of PLSA. Finally, it uses traditional n-gram modeling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2009.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modeling scheme of PLSA. Finally, it uses traditional n-gram modeling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.
基于一般领域数据的领域特定语言模型的半监督学习
提出了一种半监督学习方法,用于从一般领域数据中构建领域特定语言模型(LM)。该方法旨在借助主题建模技术,以少量的领域特定数据作为种子,挖掘存在于大量通用领域数据中的领域特定资源。该算法首先利用概率潜在语义分析(PLSA)对特定领域和通用领域数据的组合数据集进行主题分解(TD)。然后利用PLSA的混合建模方案推导出特定领域的词n-gram计数。最后,使用传统的n-gram建模方法从特定领域的单词n-gram计数构建特定领域的lm。实验结果表明,在我们的数据集上,该方法可以优于最先进的方法和模拟监督学习方法。特别是,半监督学习方法即使在非常少量的特定领域数据下也能获得更好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信