Machine Translation Vs. Multilingual Dictionaries Assessing Two Strategies for the Topic Modeling of Multilingual Text Collections

IF 3.7 1区文学 Q1 COMMUNICATION

Communication Methods and Measures Pub Date : 2021-08-17 DOI:10.1080/19312458.2021.1955845

D. Maier, C. Baden, Daniela Stoltenberg, Maya De Vries-Kedem, A. Waldherr

{"title":"Machine Translation Vs. Multilingual Dictionaries Assessing Two Strategies for the Topic Modeling of Multilingual Text Collections","authors":"D. Maier, C. Baden, Daniela Stoltenberg, Maya De Vries-Kedem, A. Waldherr","doi":"10.1080/19312458.2021.1955845","DOIUrl":null,"url":null,"abstract":"ABSTRACT The goal of this paper is to evaluate two methods for the topic modeling of multilingual document collections: (1) machine translation (MT), and (2) the coding of semantic concepts using a multilingual dictionary (MD) prior to topic modeling. We empirically assess the consequences of these approaches based on both a quantitative comparison of models and a qualitative validation of each method’s potentials and weaknesses. Our case study uses two text collections (of tweets and news articles) in three languages (English, Hebrew, Arabic), covering the ongoing local conflicts between Israeli authorities, settlers, and Palestinian Bedouins in the West Bank. We find that both methods produce a large share of equivalent topics, especially in the context of fairly homogenous news discourse, yet show limited but systematic differences when applied to highly heterogenous social media discourse. While the MD model delivers a more nuanced picture of conflict-related topics, it misses several more peripheral topics, especially those unrelated to the dictionary’s focus, which are picked up by the MT model. Our study is a first step toward instrument validation, indicating that both methods yield valid, comparable results, while method-specific differences remain.","PeriodicalId":47552,"journal":{"name":"Communication Methods and Measures","volume":"16 1","pages":"19 - 38"},"PeriodicalIF":3.7000,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communication Methods and Measures","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1080/19312458.2021.1955845","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMMUNICATION","Score":null,"Total":0}

引用次数: 16

Abstract

ABSTRACT The goal of this paper is to evaluate two methods for the topic modeling of multilingual document collections: (1) machine translation (MT), and (2) the coding of semantic concepts using a multilingual dictionary (MD) prior to topic modeling. We empirically assess the consequences of these approaches based on both a quantitative comparison of models and a qualitative validation of each method’s potentials and weaknesses. Our case study uses two text collections (of tweets and news articles) in three languages (English, Hebrew, Arabic), covering the ongoing local conflicts between Israeli authorities, settlers, and Palestinian Bedouins in the West Bank. We find that both methods produce a large share of equivalent topics, especially in the context of fairly homogenous news discourse, yet show limited but systematic differences when applied to highly heterogenous social media discourse. While the MD model delivers a more nuanced picture of conflict-related topics, it misses several more peripheral topics, especially those unrelated to the dictionary’s focus, which are picked up by the MT model. Our study is a first step toward instrument validation, indicating that both methods yield valid, comparable results, while method-specific differences remain.

查看原文本刊更多论文

机器翻译与多语言词典:评估多语言文本集主题建模的两种策略

本文的目的是评估两种用于多语言文档集合主题建模的方法:(1)机器翻译(MT)和(2)在主题建模之前使用多语言词典(MD)对语义概念进行编码。我们根据模型的定量比较和每种方法的潜力和弱点的定性验证，经验地评估这些方法的后果。我们的案例研究使用三种语言(英语、希伯来语、阿拉伯语)的两个文本集(tweet和新闻文章)，涵盖了以色列当局、定居者和西岸的巴勒斯坦贝都因人之间正在进行的局部冲突。我们发现，这两种方法都产生了大量的等效话题，特别是在相当同质的新闻话语背景下，然而，当应用于高度异质的社交媒体话语时，它们表现出有限但系统的差异。虽然MD模型提供了与冲突相关主题的更细致入微的图景，但它遗漏了几个更外围的主题，特别是那些与词典重点无关的主题，而这些主题由MT模型挑选出来。我们的研究是仪器验证的第一步，表明两种方法都产生有效的、可比较的结果，但方法特异性差异仍然存在。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communication Methods and Measures COMMUNICATION-

CiteScore

21.10

自引率

1.80%

发文量

期刊介绍： Communication Methods and Measures aims to achieve several goals in the field of communication research. Firstly, it aims to bring attention to and showcase developments in both qualitative and quantitative research methodologies to communication scholars. This journal serves as a platform for researchers across the field to discuss and disseminate methodological tools and approaches. Additionally, Communication Methods and Measures seeks to improve research design and analysis practices by offering suggestions for improvement. It aims to introduce new methods of measurement that are valuable to communication scientists or enhance existing methods. The journal encourages submissions that focus on methods for enhancing research design and theory testing, employing both quantitative and qualitative approaches. Furthermore, the journal is open to articles devoted to exploring the epistemological aspects relevant to communication research methodologies. It welcomes well-written manuscripts that demonstrate the use of methods and articles that highlight the advantages of lesser-known or newer methods over those traditionally used in communication. In summary, Communication Methods and Measures strives to advance the field of communication research by showcasing and discussing innovative methodologies, improving research practices, and introducing new measurement methods.