Topic identification from news blog in Spanish language

Q4 Computer Science
Lizbeth Pacheco-Guevara, R. Reátegui, P. Valdiviezo-Diaz
{"title":"Topic identification from news blog in Spanish language","authors":"Lizbeth Pacheco-Guevara, R. Reátegui, P. Valdiviezo-Diaz","doi":"10.33936/isrtic.v6i1.4514","DOIUrl":null,"url":null,"abstract":"Currently exist a large amount of news in a digital format that need to be classified or labeled automatically according to their content.  LDA is an unsupervised technique that automatically creates topics based on words in documents. The present work aims to apply LDA in order to analyze and extract topic from digital news in Spanish language. A total of 198 digital news was collected from a university news blog. A data pre-processing and representation in vector spaces was carried out and k values were selected based on coherence metric. A TF_IDF matrix and a combination of unigrams and bigrams produce topics with a variety of terms and topics related to university activities like study programs, research, projects for innovation and social responsibility. Furthermore, with the manual validation process, terms in topics correspond with hashtags written by the communication professionals.","PeriodicalId":53421,"journal":{"name":"Revista de Informatica Teorica e Aplicada","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista de Informatica Teorica e Aplicada","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33936/isrtic.v6i1.4514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Currently exist a large amount of news in a digital format that need to be classified or labeled automatically according to their content.  LDA is an unsupervised technique that automatically creates topics based on words in documents. The present work aims to apply LDA in order to analyze and extract topic from digital news in Spanish language. A total of 198 digital news was collected from a university news blog. A data pre-processing and representation in vector spaces was carried out and k values were selected based on coherence metric. A TF_IDF matrix and a combination of unigrams and bigrams produce topics with a variety of terms and topics related to university activities like study programs, research, projects for innovation and social responsibility. Furthermore, with the manual validation process, terms in topics correspond with hashtags written by the communication professionals.
西班牙语新闻博客的主题识别
目前存在大量的数字格式的新闻,需要根据其内容自动分类或标记。LDA是一种无监督技术,可以根据文档中的单词自动创建主题。本研究旨在应用LDA对西班牙语数字新闻进行主题分析和提取。从某高校新闻博客中收集了198条数字新闻。对数据进行预处理和向量空间表示,并根据相干度选择k值。TF_IDF矩阵以及一元和双元的组合产生了与学习计划、研究、创新项目和社会责任等大学活动相关的各种术语和主题的主题。此外,通过手动验证过程,主题中的术语与通信专业人员编写的标签相对应。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Revista de Informatica Teorica e Aplicada
Revista de Informatica Teorica e Aplicada Computer Science-Computer Science (all)
CiteScore
0.90
自引率
0.00%
发文量
14
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信