评估葡萄牙语文本的主题建模预处理管道

Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, E. Albergaria, L. Rocha
{"title":"评估葡萄牙语文本的主题建模预处理管道","authors":"Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, E. Albergaria, L. Rocha","doi":"10.1145/3539637.3557052","DOIUrl":null,"url":null,"abstract":"Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).","PeriodicalId":350776,"journal":{"name":"Proceedings of the Brazilian Symposium on Multimedia and the Web","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts\",\"authors\":\"Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, E. Albergaria, L. Rocha\",\"doi\":\"10.1145/3539637.3557052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).\",\"PeriodicalId\":350776,\"journal\":{\"name\":\"Proceedings of the Brazilian Symposium on Multimedia and the Web\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Brazilian Symposium on Multimedia and the Web\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539637.3557052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Brazilian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539637.3557052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

主题建模是从大量数据中提取和组织信息的最常用方法之一。基本上,这些方法旨在从文本文档(例如,产品评论、tweet)中找到语义主题。尽管这些方法在英语文本中取得了良好的效果,但在葡萄牙语文本中应用时,我们没有观察到相同的语义质量,因为葡萄牙语文本更加冗长,呈现出各种复杂的动词共轭和许多同音异义词,以及其他特殊的特点。这项工作旨在通过开发和评估葡萄牙语文本的不同主题建模预处理管道来填补这一科学空白,这些管道对应于在TM策略之前需要执行的任务序列。更具体地说,我们评估了使用不同语义数据表示的不同预处理管道配置,以克服葡萄牙语文本TM策略面临的挑战。在我们的实验评估中,考虑到从Twitter和Reddit收集的两个与巴西政治讨论相关的数据集,我们表明,与最初为英语文本提出的TM方法相比,我们提出的扩展预处理管道,特别是考虑语义表示,可以获得显著的有效性提升(提高9倍)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts
Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信