,行动!利用多模态模式进行故事叙述和内容分析

Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery Pub Date : 2020-10-12 DOI:10.1145/3422839.3423060

Natalie Parde

{"title":",行动!利用多模态模式进行故事叙述和内容分析","authors":"Natalie Parde","doi":"10.1145/3422839.3423060","DOIUrl":null,"url":null,"abstract":"Humans perform intelligent tasks by productively leveraging relevant information from numerous sensory and experiential inputs, and recent scientific and hardware advances have made it increasingly possible for machines to attempt this as well. However, improved resource availability does not automatically give rise to humanlike performance in complex tasks [1]. In this talk, I discuss recent work towards three tasks that benefit from an elegant synthesis of linguistic and visual input: visual storytelling, visual question answering (VQA), and affective content analysis. I focus primarily on visual storytelling, a burgeoning task with the goal of generating coherent, sensible narratives for sequences of input images [2]. I analyze recent work in this area, and then introduce a novel visual storytelling approach that employs a hierarchical context-based network, with a co-attention mechanism that jointly attends to patterns in visual (image) and linguistic (description) input. Following this, I describe ongoing work in VQA, another inherently multimodal task with the goal of producing accurate, sensible answers to questions about images. I explore a formulation in which the VQA model generates unconstrained, free-form text, providing preliminary evidence that harnessing the linguistic patterns latent in language models results in competitive task performance [3]. Finally, I introduce some intriguing new work that investigates the utility of linguistic patterns in a task that is not inherently multimodal: analyzing the affective content of images. I close by suggesting some exciting future directions for each of these tasks as they pertain to multimodal media analysis.","PeriodicalId":270338,"journal":{"name":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"And, Action! Towards Leveraging Multimodal Patterns for Storytelling and Content Analysis\",\"authors\":\"Natalie Parde\",\"doi\":\"10.1145/3422839.3423060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans perform intelligent tasks by productively leveraging relevant information from numerous sensory and experiential inputs, and recent scientific and hardware advances have made it increasingly possible for machines to attempt this as well. However, improved resource availability does not automatically give rise to humanlike performance in complex tasks [1]. In this talk, I discuss recent work towards three tasks that benefit from an elegant synthesis of linguistic and visual input: visual storytelling, visual question answering (VQA), and affective content analysis. I focus primarily on visual storytelling, a burgeoning task with the goal of generating coherent, sensible narratives for sequences of input images [2]. I analyze recent work in this area, and then introduce a novel visual storytelling approach that employs a hierarchical context-based network, with a co-attention mechanism that jointly attends to patterns in visual (image) and linguistic (description) input. Following this, I describe ongoing work in VQA, another inherently multimodal task with the goal of producing accurate, sensible answers to questions about images. I explore a formulation in which the VQA model generates unconstrained, free-form text, providing preliminary evidence that harnessing the linguistic patterns latent in language models results in competitive task performance [3]. Finally, I introduce some intriguing new work that investigates the utility of linguistic patterns in a task that is not inherently multimodal: analyzing the affective content of images. I close by suggesting some exciting future directions for each of these tasks as they pertain to multimodal media analysis.\",\"PeriodicalId\":270338,\"journal\":{\"name\":\"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3422839.3423060\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3422839.3423060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

人类通过有效地利用来自大量感官和经验输入的相关信息来执行智能任务，而最近的科学和硬件进步也使机器越来越有可能尝试这一点。然而，在复杂的任务中，资源可用性的提高并不会自动产生类似人类的性能[1]。在这次演讲中，我将讨论最近在三个任务方面的工作，这些任务得益于语言和视觉输入的优雅综合:视觉讲故事，视觉问答(VQA)和情感内容分析。我主要专注于视觉叙事，这是一项新兴的任务，目标是为输入图像序列生成连贯、合理的叙事。我分析了该领域的最新工作，然后介绍了一种新颖的视觉叙事方法，该方法采用基于上下文的分层网络，并采用共同注意机制，共同关注视觉(图像)和语言(描述)输入中的模式。接下来，我将描述VQA中正在进行的工作，这是另一个固有的多模式任务，其目标是为有关图像的问题提供准确、合理的答案。我探索了一个VQA模型生成不受约束的、自由形式的文本的公式，提供了初步的证据，证明利用语言模型中潜在的语言模式会导致竞争性任务绩效bb0。最后，我介绍了一些有趣的新工作，研究了语言模式在非多模态任务中的效用:分析图像的情感内容。最后，我提出了这些任务的一些令人兴奋的未来方向，因为它们与多模式媒体分析有关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

And, Action! Towards Leveraging Multimodal Patterns for Storytelling and Content Analysis

Humans perform intelligent tasks by productively leveraging relevant information from numerous sensory and experiential inputs, and recent scientific and hardware advances have made it increasingly possible for machines to attempt this as well. However, improved resource availability does not automatically give rise to humanlike performance in complex tasks [1]. In this talk, I discuss recent work towards three tasks that benefit from an elegant synthesis of linguistic and visual input: visual storytelling, visual question answering (VQA), and affective content analysis. I focus primarily on visual storytelling, a burgeoning task with the goal of generating coherent, sensible narratives for sequences of input images [2]. I analyze recent work in this area, and then introduce a novel visual storytelling approach that employs a hierarchical context-based network, with a co-attention mechanism that jointly attends to patterns in visual (image) and linguistic (description) input. Following this, I describe ongoing work in VQA, another inherently multimodal task with the goal of producing accurate, sensible answers to questions about images. I explore a formulation in which the VQA model generates unconstrained, free-form text, providing preliminary evidence that harnessing the linguistic patterns latent in language models results in competitive task performance [3]. Finally, I introduce some intriguing new work that investigates the utility of linguistic patterns in a task that is not inherently multimodal: analyzing the affective content of images. I close by suggesting some exciting future directions for each of these tasks as they pertain to multimodal media analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery

自引率

0.00%

发文量