Graph-based Topic Extraction Using Centroid Distance of Phrase Embeddings on Healthy Aging Open-ended Survey Questions

2020 International Conference on Data Mining Workshops (ICDMW) Pub Date : 2020-11-01 DOI:10.1109/ICDMW51313.2020.00088

D. Kosmajac, Kirstie Smith, Vlado Keselj, S. Kirkland

{"title":"Graph-based Topic Extraction Using Centroid Distance of Phrase Embeddings on Healthy Aging Open-ended Survey Questions","authors":"D. Kosmajac, Kirstie Smith, Vlado Keselj, S. Kirkland","doi":"10.1109/ICDMW51313.2020.00088","DOIUrl":null,"url":null,"abstract":"Open-ended questions are a very important part of research surveys. However, they can pose a challenge when it comes to processing since manual processing requires a labour-intensive human effort. Automation of the task requires application of NLP methods since free text does not ensure standardized structure. To tackle this problem, we present a solution for topic discovery and analysis of open-ended survey items. We use graph-based representation of the text that adds structure and enables easier manipulation and keyphrase retrieval. Additionally, we use pre-trained fastText aligned word vectors to cluster similar phrases even if they are written in different languages. The goal is to produce topic word and phrase representatives that are easy to interpret by a domain expert. We compare the method with traditional LDA and two state-of-the-art algorithms: BTM and WNTM. The resulting keyphrases representing topics are more intuitive to the domain experts than the ones obtained by reference topic models in similar experimental settings.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW51313.2020.00088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Open-ended questions are a very important part of research surveys. However, they can pose a challenge when it comes to processing since manual processing requires a labour-intensive human effort. Automation of the task requires application of NLP methods since free text does not ensure standardized structure. To tackle this problem, we present a solution for topic discovery and analysis of open-ended survey items. We use graph-based representation of the text that adds structure and enables easier manipulation and keyphrase retrieval. Additionally, we use pre-trained fastText aligned word vectors to cluster similar phrases even if they are written in different languages. The goal is to produce topic word and phrase representatives that are easy to interpret by a domain expert. We compare the method with traditional LDA and two state-of-the-art algorithms: BTM and WNTM. The resulting keyphrases representing topics are more intuitive to the domain experts than the ones obtained by reference topic models in similar experimental settings.

查看原文本刊更多论文

基于短语嵌入质心距离的健康老龄化开放式调查问题图主题提取

开放式问题是研究性调查的重要组成部分。然而，当涉及到处理时，它们可能构成挑战，因为手动处理需要劳动密集型的人力。由于自由文本不能保证标准化的结构，任务的自动化需要应用自然语言处理方法。为了解决这个问题，我们提出了一个开放式调查项目的主题发现和分析方案。我们使用基于图形的文本表示，增加了结构，使操作和关键短语检索更容易。此外，我们使用预先训练的fastText对齐词向量来聚类相似的短语，即使它们是用不同的语言写的。目标是生成易于由领域专家解释的主题词和短语代表。我们将该方法与传统的LDA和两种最先进的算法:BTM和WNTM进行了比较。对于领域专家来说，所得到的代表主题的关键短语比在类似实验环境下通过参考主题模型获得的关键短语更直观。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量