Understanding Narratives from Demographic Survey Data: a Comparative Study with Multiple Neural Topic Models

Xiao Xu, Gert Stulp, Antal van den Bosch, Anne Gauthier
{"title":"Understanding Narratives from Demographic Survey Data: a Comparative Study with Multiple Neural Topic Models","authors":"Xiao Xu, Gert Stulp, Antal van den Bosch, Anne Gauthier","doi":"10.18653/v1/2022.nlpcss-1.4","DOIUrl":null,"url":null,"abstract":"Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.","PeriodicalId":438120,"journal":{"name":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.nlpcss-1.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.
从人口调查数据中理解叙事:多神经主题模型的比较研究
调查中口头表达的生育意愿并不能很好地预测实际的生育结果,即人们拥有的孩子数量。这在一定程度上可以用人们对自己意图的不确定性来解释。这种不确定性很难通过传统的调查问题捕捉到,尽管开放式问题可以用来洞察人们对未来的主观叙述,这些叙述决定了他们的意图。分析开放式问题的答案可以通过自然语言处理技术来完成。然而,传统的主题模型(例如LSA和LDA)往往不能做到这一点,因为它们依赖于共发生,而这在简短的调查回答中往往很少见。本研究的目的是应用和评估人口调查数据的主题模型。在本研究中,我们应用基于语言模型的神经话题模型(如BERTopic, CombinedTM)对荷兰妇女关于生育计划的回答进行了分析,并将每个模型的话题及其一致性得分与专家判断进行了比较。我们的研究结果表明,与LDA相比,神经模型产生的主题更符合人类的解释。然而,连贯分数只能部分反映这一点,这取决于用于计算的语料库。这项研究的重要性在于,首先,它有助于我们在调查数据的主题建模中制定更明智的模型选择和评估策略;其次,它表明人口统计学领域可以从采用NLP方法中获益良多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信