利用维基百科知识对文本新闻进行跨语言分类

2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI) Pub Date : 2017-11-01 DOI:10.1109/ISCMI.2017.8279619

M. Mouriño-García, Roberto Pérez-Rodríguez, L. Anido-Rifón

{"title":"利用维基百科知识对文本新闻进行跨语言分类","authors":"M. Mouriño-García, Roberto Pérez-Rodríguez, L. Anido-Rifón","doi":"10.1109/ISCMI.2017.8279619","DOIUrl":null,"url":null,"abstract":"This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. We describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed we present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.","PeriodicalId":119111,"journal":{"name":"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging wikipedia knowledge to cross-language classify textual news\",\"authors\":\"M. Mouriño-García, Roberto Pérez-Rodríguez, L. Anido-Rifón\",\"doi\":\"10.1109/ISCMI.2017.8279619\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. We describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed we present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.\",\"PeriodicalId\":119111,\"journal\":{\"name\":\"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCMI.2017.8279619\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCMI.2017.8279619","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了利用维基百科知识将文本新闻故事表示为维基百科概念向量的第一次尝试，并分析了它在仅用英语训练时创建西班牙语文本新闻故事的跨语言分类器的适用性。我们描述了两种方法。第一个仅基于维基百科概念来表示新闻故事(WikiBoC-CLCM)。第二种方法(Hybrid-WikiBoC)将WikiBoC-CLCM分类器与基于词包模型的最先进方法以及机器翻译技术(BoW-MT)相结合。为了评估所提出的方法，我们提出了一个由英语和西班牙语撰写的新闻组成的数据集，从路透社和欧罗巴新闻社等几家在线报纸和新闻机构中提取。结果表明，纯基于概念的WikiBoC-CLCM方法提供了最高的分类性能，比最先进的BoW-MT方法提高了55.07%。Hybrid-WikiBoC方法也优于BoW-MT模型，实现了2.34%的性能提升。我们得出结论，利用维基百科知识在文本新闻故事的跨语言分类任务中具有很大的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Leveraging wikipedia knowledge to cross-language classify textual news

This paper presents a first attempt of leveraging Wikipedia knowledge to represent textual news stories as vectors of Wikipedia concepts, and analysis its suitability for creating a cross-language classifier of textual news stories written in Spanish when it is trained only with English ones. We describe two approaches. The first one is based only on Wikipedia concepts to represent the news stories (WikiBoC-CLCM). The second approach (Hybrid-WikiBoC) combines the WikiBoC-CLCM classifier with the state-of-the-art approach based on the bag of words model along with machine translation techniques (BoW-MT). To evaluate the approaches proposed we present a dataset composed of news written in English and Spanish, extracted from several online newspapers and news agencies such as Reuters and Europa Press. The results obtained show that the purely based on concepts WikiBoC-CLCM approach offers the highest classification performance, achieving increases up to 55.07% over the state-of-the-art BoW-MT approach. The Hybrid-WikiBoC approach also outperforms the BoW-MT model, achieving performance increases up to 2.34% We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of cross-language classification of textual news stories.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 4th International Conference on Soft Computing & Machine Intelligence (ISCMI)

自引率

0.00%

发文量