通过外观的共同参考:视觉上基于事件的共同参考分辨率

Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference Pub Date : 1900-01-01 DOI:10.18653/v1/2021.crac-1.14

Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang

{"title":"通过外观的共同参考:视觉上基于事件的共同参考分辨率","authors":"Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang","doi":"10.18653/v1/2021.crac-1.14","DOIUrl":null,"url":null,"abstract":"Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in event coreference resolution tasks. A careful analysis reveals that the performance gain of the multimodal model especially under unsupervised settings comes from better learning of visually salient events.","PeriodicalId":447425,"journal":{"name":"Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Coreference by Appearance: Visually Grounded Event Coreference Resolution\",\"authors\":\"Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang\",\"doi\":\"10.18653/v1/2021.crac-1.14\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in event coreference resolution tasks. A careful analysis reveals that the performance gain of the multimodal model especially under unsupervised settings comes from better learning of visually salient events.\",\"PeriodicalId\":447425,\"journal\":{\"name\":\"Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2021.crac-1.14\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2021.crac-1.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

事件共指解析对于理解越来越多的包括文本、视频、语音等多种形式的在线新闻中的事件至关重要。然而，以不同模式描述的事件和实体可能不会完全对齐，并且难以注释，这使得在缺乏监督的情况下，任务尤其具有挑战性。为了解决上述问题，我们提出了一种基于注意机制的监督模型和一种基于统计机器翻译的无监督模型，能够学习事件共同参考解决的模式的相对重要性。在视频多媒体事件数据集上的实验表明，我们的多模态模型在事件共引用解析任务中优于纯文本系统。仔细分析表明，多模态模型的性能增益，特别是在无监督设置下，来自更好地学习视觉显著事件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Coreference by Appearance: Visually Grounded Event Coreference Resolution

Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in event coreference resolution tasks. A careful analysis reveals that the performance gain of the multimodal model especially under unsupervised settings comes from better learning of visually salient events.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

自引率

0.00%

发文量