An EM based training algorithm for cross-language text categorization

Leonardo Rigutini, Marco Maggini, B. Liu
{"title":"An EM based training algorithm for cross-language text categorization","authors":"Leonardo Rigutini, Marco Maggini, B. Liu","doi":"10.1109/WI.2005.29","DOIUrl":null,"url":null,"abstract":"Due to the globalization on the Web, many companies and institutions need to efficiently organize and search repositories containing multilingual documents. The management of these heterogeneous text collections increases the costs significantly because experts of different languages are required to organize these collections. Cross-language text categorization can provide techniques to extend existing automatic classification systems in one language to new languages without requiring additional intervention of human experts. In this paper, we propose a learning algorithm based on the EM scheme which can be used to train text classifiers in a multilingual environment. In particular, in the proposed approach, we assume that a predefined category set and a collection of labeled training data is available for a given language L/sub 1/. A classifier for a different language L/sub 2/ is trained by translating the available labeled training set for L/sub 1/ to L/sub 2/ and by using an additional set of unlabeled documents from L/sub 2/. This technique allows us to extract correct statistical properties of the language L/sub 2/ which are not completely available in automatically translated examples, because of the different characteristics of language L/sub 1/ and of the approximation of the translation process. Our experimental results show that the performance of the proposed method is very promising when applied on a test document set extracted from newsgroups in English and Italian.","PeriodicalId":213856,"journal":{"name":"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"101","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2005.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 101

Abstract

Due to the globalization on the Web, many companies and institutions need to efficiently organize and search repositories containing multilingual documents. The management of these heterogeneous text collections increases the costs significantly because experts of different languages are required to organize these collections. Cross-language text categorization can provide techniques to extend existing automatic classification systems in one language to new languages without requiring additional intervention of human experts. In this paper, we propose a learning algorithm based on the EM scheme which can be used to train text classifiers in a multilingual environment. In particular, in the proposed approach, we assume that a predefined category set and a collection of labeled training data is available for a given language L/sub 1/. A classifier for a different language L/sub 2/ is trained by translating the available labeled training set for L/sub 1/ to L/sub 2/ and by using an additional set of unlabeled documents from L/sub 2/. This technique allows us to extract correct statistical properties of the language L/sub 2/ which are not completely available in automatically translated examples, because of the different characteristics of language L/sub 1/ and of the approximation of the translation process. Our experimental results show that the performance of the proposed method is very promising when applied on a test document set extracted from newsgroups in English and Italian.
基于EM的跨语言文本分类训练算法
由于Web上的全球化,许多公司和机构需要有效地组织和搜索包含多语言文档的存储库。由于需要不同语言的专家来组织这些集合,因此对这些异构文本集合的管理大大增加了成本。跨语言文本分类可以提供技术,将现有的自动分类系统扩展到一种语言的新语言,而不需要人类专家的额外干预。在本文中,我们提出了一种基于EM方案的学习算法,该算法可用于多语言环境下的文本分类器训练。特别地,在提出的方法中,我们假设对于给定的语言L/sub /,有一个预定义的类别集和一组标记的训练数据。通过将L/sub 1/的可用标记训练集翻译为L/sub 2/并使用来自L/sub 2/的额外未标记文档集来训练不同语言L/sub 2/的分类器。这种技术使我们能够提取语言L/sub 2/的正确统计属性,这在自动翻译示例中是不完全可用的,因为语言L/sub 1/和翻译过程的近似具有不同的特征。实验结果表明,该方法在从英语和意大利语新闻组中提取的测试文档集上取得了很好的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信