利用维基百科知识对文本新闻进行跨语言分类

M. Mouriño-García, Roberto Pérez-Rodríguez, L. Anido-Rifón
{"title":"利用维基百科知识对文本新闻进行跨语言分类","authors":"M. Mouriño-García, Roberto Pérez-Rodríguez, L. Anido-Rifón","doi":"10.1109/ISCMI.2017.8279619","DOIUrl":null,"url":null,"abstract":"This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. We describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed we present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.","PeriodicalId":119111,"journal":{"name":"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging wikipedia knowledge to cross-language classify textual news\",\"authors\":\"M. Mouriño-García, Roberto Pérez-Rodríguez, L. Anido-Rifón\",\"doi\":\"10.1109/ISCMI.2017.8279619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. We describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed we present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.\",\"PeriodicalId\":119111,\"journal\":{\"name\":\"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCMI.2017.8279619\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCMI.2017.8279619","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文提出了利用维基百科知识将文本新闻故事表示为维基百科概念向量的第一次尝试,并分析了它在仅用英语训练时创建西班牙语文本新闻故事的跨语言分类器的适用性。我们描述了两种方法。第一个仅基于维基百科概念来表示新闻故事(WikiBoC-CLCM)。第二种方法(Hybrid-WikiBoC)将WikiBoC-CLCM分类器与基于词包模型的最先进方法以及机器翻译技术(BoW-MT)相结合。为了评估所提出的方法,我们提出了一个由英语和西班牙语撰写的新闻组成的数据集,从路透社和欧罗巴新闻社等几家在线报纸和新闻机构中提取。结果表明,纯基于概念的WikiBoC-CLCM方法提供了最高的分类性能,比最先进的BoW-MT方法提高了55.07%。Hybrid-WikiBoC方法也优于BoW-MT模型,实现了2.34%的性能提升。我们得出结论,利用维基百科知识在文本新闻故事的跨语言分类任务中具有很大的优势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Leveraging wikipedia knowledge to cross-language classify textual news
This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. We describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed we present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信