Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts

Proceedings of the Brazilian Symposium on Multimedia and the Web Pub Date : 2022-11-07 DOI:10.1145/3539637.3557052

Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, E. Albergaria, L. Rocha

{"title":"Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts","authors":"Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, E. Albergaria, L. Rocha","doi":"10.1145/3539637.3557052","DOIUrl":null,"url":null,"abstract":"Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).","PeriodicalId":350776,"journal":{"name":"Proceedings of the Brazilian Symposium on Multimedia and the Web","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Brazilian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539637.3557052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).

查看原文本刊更多论文

评估葡萄牙语文本的主题建模预处理管道

主题建模是从大量数据中提取和组织信息的最常用方法之一。基本上，这些方法旨在从文本文档(例如，产品评论、tweet)中找到语义主题。尽管这些方法在英语文本中取得了良好的效果，但在葡萄牙语文本中应用时，我们没有观察到相同的语义质量，因为葡萄牙语文本更加冗长，呈现出各种复杂的动词共轭和许多同音异义词，以及其他特殊的特点。这项工作旨在通过开发和评估葡萄牙语文本的不同主题建模预处理管道来填补这一科学空白，这些管道对应于在TM策略之前需要执行的任务序列。更具体地说，我们评估了使用不同语义数据表示的不同预处理管道配置，以克服葡萄牙语文本TM策略面临的挑战。在我们的实验评估中，考虑到从Twitter和Reddit收集的两个与巴西政治讨论相关的数据集，我们表明，与最初为英语文本提出的TM方法相比，我们提出的扩展预处理管道，特别是考虑语义表示，可以获得显著的有效性提升(提高9倍)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Brazilian Symposium on Multimedia and the Web

自引率

0.00%

发文量