{"title":"人在环主题建模:评估主题标签和类型-主题关系与电影情节摘要语料库","authors":"P. Matthews","doi":"10.5771/9783956505508-181","DOIUrl":null,"url":null,"abstract":"A much-used but not yet mainstream text analysis approach, topic modelling allows the identification of lexical themes for a document collection. Against principles for interpretable AI and sociotechnical design, there are definite strengths from its speed and ability to discover structure, but remain challenges in how results can be interpreted whether this be by analysts, domain experts, or potential end users. Automated coherence and labelling measures go some of the way toward bridging the understanding and trust gap, and user empowerment through visualisation and design intervention is starting to show how the remaining ground might be made up. This study uses topic modelling on a corpus of Wikipedia movie summaries to illustrate challenges and potential. Topic labelling for naive users was found to only be easy in a quarter of cases, and difficulty increased markedly with 100 topics compared to 50. While automated measures suggested 88 topics, the number manageable by users was closer to 50. The unsupervised topic model was compared to the movie genre labels and indicated that the two might work together well to complement genres, match content across genre and highlight within-genre variability. It is suggested that unsupervised models might work better for creativity and discovery than semi-supervised versions.","PeriodicalId":111345,"journal":{"name":"The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Human-In-The-Loop Topic Modelling: Assessing topic labelling and genre-topic relations with a movie plot summary corpus\",\"authors\":\"P. Matthews\",\"doi\":\"10.5771/9783956505508-181\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A much-used but not yet mainstream text analysis approach, topic modelling allows the identification of lexical themes for a document collection. Against principles for interpretable AI and sociotechnical design, there are definite strengths from its speed and ability to discover structure, but remain challenges in how results can be interpreted whether this be by analysts, domain experts, or potential end users. Automated coherence and labelling measures go some of the way toward bridging the understanding and trust gap, and user empowerment through visualisation and design intervention is starting to show how the remaining ground might be made up. This study uses topic modelling on a corpus of Wikipedia movie summaries to illustrate challenges and potential. Topic labelling for naive users was found to only be easy in a quarter of cases, and difficulty increased markedly with 100 topics compared to 50. While automated measures suggested 88 topics, the number manageable by users was closer to 50. The unsupervised topic model was compared to the movie genre labels and indicated that the two might work together well to complement genres, match content across genre and highlight within-genre variability. It is suggested that unsupervised models might work better for creativity and discovery than semi-supervised versions.\",\"PeriodicalId\":111345,\"journal\":{\"name\":\"The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5771/9783956505508-181\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Human Position in an Artificial World: Creativity, Ethics and AI in Knowledge Organization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5771/9783956505508-181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Human-In-The-Loop Topic Modelling: Assessing topic labelling and genre-topic relations with a movie plot summary corpus
A much-used but not yet mainstream text analysis approach, topic modelling allows the identification of lexical themes for a document collection. Against principles for interpretable AI and sociotechnical design, there are definite strengths from its speed and ability to discover structure, but remain challenges in how results can be interpreted whether this be by analysts, domain experts, or potential end users. Automated coherence and labelling measures go some of the way toward bridging the understanding and trust gap, and user empowerment through visualisation and design intervention is starting to show how the remaining ground might be made up. This study uses topic modelling on a corpus of Wikipedia movie summaries to illustrate challenges and potential. Topic labelling for naive users was found to only be easy in a quarter of cases, and difficulty increased markedly with 100 topics compared to 50. While automated measures suggested 88 topics, the number manageable by users was closer to 50. The unsupervised topic model was compared to the movie genre labels and indicated that the two might work together well to complement genres, match content across genre and highlight within-genre variability. It is suggested that unsupervised models might work better for creativity and discovery than semi-supervised versions.