Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization

Hua Wang, Heng Huang, F. Nie, C. Ding
{"title":"Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization","authors":"Hua Wang, Heng Huang, F. Nie, C. Ding","doi":"10.1145/2009916.2010041","DOIUrl":null,"url":null,"abstract":"The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"7 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"82","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 82

Abstract

The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.
基于非负矩阵三因子分解的双知识迁移跨语言网页分类
在许多语言中,特别是在一些不常用的语言中,缺乏足够的标记Web页面,这对传统的监督分类方法实现令人满意的Web页面分类性能提出了很大的挑战。为了解决这个问题,我们提出了一种新的基于非负矩阵三因子分解(NMTF)的双知识转移(DKT)跨语言网页分类方法,该方法基于以下两个重要的观察结果。首先,我们观察到来自不同语言的同一主题的Web页面通常共享一些共同的语义模式,尽管以不同的表示形式。其次,我们还观察到,与原始单词相比,聚类和网页类之间的关联是跨语言传递知识的更可靠的载体。通过这些识别,我们尝试将知识从辅助语言(其中有大量标记的Web页面可用)转移到目标语言(我们想要对Web页面进行分类),通过两种不同的途径:词簇近似和词簇与Web页面类之间的关联。由于这两种不同的知识转移路径之间的强化,我们的方法可以获得更好的分类精度。我们在使用真实世界跨语言Web页面数据集的大量实验中评估了所提出的方法。有希望的结果证明了我们方法的有效性,这与我们的理论分析是一致的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信