Grammar induction from visual, speech and text

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Pub Date : 2025-02-12 DOI:10.1016/j.artint.2025.104306

Yu Zhao , Hao Fei , Shengqiong Wu , Meishan Zhang , Min Zhang , Tat-seng Chua

{"title":"Grammar induction from visual, speech and text","authors":"Yu Zhao , Hao Fei , Shengqiong Wu , Meishan Zhang , Min Zhang , Tat-seng Chua","doi":"10.1016/j.artint.2025.104306","DOIUrl":null,"url":null,"abstract":"<div><div>Grammar Induction (GI) seeks to uncover the underlying grammatical rules and linguistic patterns of a language, positioning it as a pivotal research topic within Artificial Intelligence (AI). Although extensive research in GI has predominantly focused on text or other singular modalities, we reveal that GI could significantly benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel <em>unsupervised visual-audio-text grammar induction</em> task (named <strong>VAT-GI</strong>), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a <em>textless</em> setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (<strong>VaTiora</strong>) framework, which leverages rich modal-specific and complementary features for effective grammar parsing. Besides, a more challenging benchmark data is constructed to assess the generalization ability of VAT-GI system. Experiments on two benchmark datasets demonstrate that our proposed VaTiora system is more effective in incorporating the various multimodal signals, and also presents new state-of-the-art performance of VAT-GI. Further in-depth analyses are shown to gain a deep understanding of the VAT-GI task and how our VaTiora system advances. Our code and data: <span><span>https://github.com/LLLogen/VAT-GI/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":8434,"journal":{"name":"Artificial Intelligence","volume":"341 ","pages":"Article 104306"},"PeriodicalIF":4.6000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0004370225000256","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Grammar Induction (GI) seeks to uncover the underlying grammatical rules and linguistic patterns of a language, positioning it as a pivotal research topic within Artificial Intelligence (AI). Although extensive research in GI has predominantly focused on text or other singular modalities, we reveal that GI could significantly benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel unsupervised visual-audio-text grammar induction task (named VAT-GI), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a textless setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (VaTiora) framework, which leverages rich modal-specific and complementary features for effective grammar parsing. Besides, a more challenging benchmark data is constructed to assess the generalization ability of VAT-GI system. Experiments on two benchmark datasets demonstrate that our proposed VaTiora system is more effective in incorporating the various multimodal signals, and also presents new state-of-the-art performance of VAT-GI. Further in-depth analyses are shown to gain a deep understanding of the VAT-GI task and how our VaTiora system advances. Our code and data: https://github.com/LLLogen/VAT-GI/.

查看原文本刊更多论文

从视觉、语音和文本进行语法归纳

语法归纳（GI）旨在揭示语言的潜在语法规则和语言模式，将其定位为人工智能（AI）中的关键研究课题。尽管对地理标志的广泛研究主要集中在文本或其他单一模式上，但我们发现地理标志可以从丰富的异构信号（如文本、视觉和声学）中显著受益。在这个过程中，来自不同形态的特征本质上是相互补充的。有了这样的直觉，本工作引入了一种新的无监督的视觉-音频-文本语法归纳任务（称为VAT-GI），从并行图像、文本和语音输入中归纳出组成语法树。由于语言语法本身存在于语篇之外，我们认为语篇不应该是语法归纳的主导形态。因此，我们进一步引入了一种无文本的VAT-GI设置，其中任务仅依赖于视觉和听觉输入。为了完成这项任务，我们提出了一个视觉-音频-文本内-外递归自动编码器（VaTiora）框架，该框架利用丰富的特定于情态的互补特性来进行有效的语法解析。此外，构建了更具挑战性的基准数据来评估VAT-GI系统的泛化能力。在两个基准数据集上的实验表明，我们提出的VaTiora系统在整合各种多模态信号方面更有效，并且也展示了VAT-GI的最新性能。进一步深入的分析显示，以获得对VAT-GI任务的深刻理解以及我们的VaTiora系统是如何进步的。我们的代码和数据：https://github.com/LLLogen/VAT-GI/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence 工程技术-计算机：人工智能

CiteScore

11.20

自引率

1.40%

发文量

118

审稿时长

8 months

期刊介绍： The Journal of Artificial Intelligence (AIJ) welcomes papers covering a broad spectrum of AI topics, including cognition, automated reasoning, computer vision, machine learning, and more. Papers should demonstrate advancements in AI and propose innovative approaches to AI problems. Additionally, the journal accepts papers describing AI applications, focusing on how new methods enhance performance rather than reiterating conventional approaches. In addition to regular papers, AIJ also accepts Research Notes, Research Field Reviews, Position Papers, Book Reviews, and summary papers on AI challenges and competitions.