Human-In-The-Loop Topic Modelling: Assessing topic labelling and genre-topic relations with a movie plot summary corpus

The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization Pub Date : 2020-03-31 DOI:10.5771/9783956505508-181

P. Matthews

{"title":"Human-In-The-Loop Topic Modelling: Assessing topic labelling and genre-topic relations with a movie plot summary corpus","authors":"P. Matthews","doi":"10.5771/9783956505508-181","DOIUrl":null,"url":null,"abstract":"A much-used but not yet mainstream text analysis approach, topic modelling allows the identification of lexical themes for a document collection. Against principles for interpretable AI and sociotechnical design, there are definite strengths from its speed and ability to discover structure, but remain challenges in how results can be interpreted whether this be by analysts, domain experts, or potential end users. Automated coherence and labelling measures go some of the way toward bridging the understanding and trust gap, and user empowerment through visualisation and design intervention is starting to show how the remaining ground might be made up. This study uses topic modelling on a corpus of Wikipedia movie summaries to illustrate challenges and potential. Topic labelling for naive users was found to only be easy in a quarter of cases, and difficulty increased markedly with 100 topics compared to 50. While automated measures suggested 88 topics, the number manageable by users was closer to 50. The unsupervised topic model was compared to the movie genre labels and indicated that the two might work together well to complement genres, match content across genre and highlight within-genre variability. It is suggested that unsupervised models might work better for creativity and discovery than semi-supervised versions.","PeriodicalId":111345,"journal":{"name":"The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5771/9783956505508-181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

A much-used but not yet mainstream text analysis approach, topic modelling allows the identification of lexical themes for a document collection. Against principles for interpretable AI and sociotechnical design, there are definite strengths from its speed and ability to discover structure, but remain challenges in how results can be interpreted whether this be by analysts, domain experts, or potential end users. Automated coherence and labelling measures go some of the way toward bridging the understanding and trust gap, and user empowerment through visualisation and design intervention is starting to show how the remaining ground might be made up. This study uses topic modelling on a corpus of Wikipedia movie summaries to illustrate challenges and potential. Topic labelling for naive users was found to only be easy in a quarter of cases, and difficulty increased markedly with 100 topics compared to 50. While automated measures suggested 88 topics, the number manageable by users was closer to 50. The unsupervised topic model was compared to the movie genre labels and indicated that the two might work together well to complement genres, match content across genre and highlight within-genre variability. It is suggested that unsupervised models might work better for creativity and discovery than semi-supervised versions.

查看原文本刊更多论文

人在环主题建模:评估主题标签和类型-主题关系与电影情节摘要语料库

主题建模是一种广泛使用但尚未成为主流的文本分析方法，它允许识别文档集合的词汇主题。相对于可解释人工智能和社会技术设计的原则，它在速度和发现结构的能力上有明显的优势，但在如何解释结果方面仍然存在挑战，无论是由分析师、领域专家还是潜在的最终用户。自动化一致性和标签措施在一定程度上有助于弥合理解和信任差距，而通过可视化和设计干预的用户授权开始显示如何弥补剩余的基础。本研究在维基百科电影摘要语料库上使用主题建模来说明挑战和潜力。对于新手用户来说，主题标签只在四分之一的情况下是容易的，与50个主题相比，100个主题的难度明显增加。虽然自动测量显示有88个主题，但用户可管理的主题数量接近50个。将无监督主题模型与电影类型标签进行比较，表明两者可以很好地协同工作，以补充类型，匹配跨类型的内容，并突出类型内的可变性。有人认为，无监督模型可能比半监督模型更有利于创造力和发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization

自引率

0.00%

发文量