CycleGAN*：利用改进的对抗神经网络进行多模态数据的人工智能协作学习

IEEE transactions on artificial intelligence Pub Date : 2024-07-23 DOI:10.1109/TAI.2024.3432856

Yibo He;Kah Phooi Seng;Li Minn Ang

{"title":"CycleGAN*：利用改进的对抗神经网络进行多模态数据的人工智能协作学习","authors":"Yibo He;Kah Phooi Seng;Li Minn Ang","doi":"10.1109/TAI.2024.3432856","DOIUrl":null,"url":null,"abstract":"With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio–visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN*, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN* exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN* exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model’s capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 11","pages":"5616-5629"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CycleGAN*: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data\",\"authors\":\"Yibo He;Kah Phooi Seng;Li Minn Ang\",\"doi\":\"10.1109/TAI.2024.3432856\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio–visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN*, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN* exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN* exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model’s capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.\",\"PeriodicalId\":73305,\"journal\":{\"name\":\"IEEE transactions on artificial intelligence\",\"volume\":\"5 11\",\"pages\":\"5616-5629\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on artificial intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10607911/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10607911/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着生成式对抗网络（GAN）在样本生成方面的广泛应用，本文旨在增强对抗神经网络，以促进协作式人工智能（AI）学习，这种学习是专门为处理包含多模态的数据集而量身定制的。目前，有相当一部分文献致力于使用 GAN 生成样本，目的是通过对抗训练将这些生成的数据纳入原始训练集，从而提高机器学习（ML）分类器的检测性能。生成的对抗样本的质量取决于训练数据样本是否充足。然而，在多模态领域，由于资源限制，多模态数据的稀缺性带来了挑战。在本文中，我们提出了一种基于经典视听语音识别（AVSR）任务的新的多模态数据集生成方法，利用 CycleGAN、DiscoGAN 和 StyleGAN2 进行探索和性能比较，从而应对这一挑战。AVSR 实验使用 LRS2 和 LRS3 语料库进行。实验结果表明，CycleGAN、DiscoGAN 和 StyleGAN2 无法有效解决 AVSR 分类中的低数据状态问题。因此，我们在原始 CycleGAN 的基础上引入了一个增强模型 CycleGAN*，它能有效地学习原始数据集特征并生成高质量的多模态数据。实验结果表明，由我们提出的 CycleGAN* 生成的多模态数据集在字错误率（WER）方面有显著改善，表明错误减少。值得注意的是，CycleGAN* 生成的图像在整体视觉清晰度上有明显提高，这表明它具有卓越的生成能力。此外，与传统方法相比，我们强调了协作学习的重要性。我们利用多样化的多模态数据实施协同训练，以促进信息共享和跨模态互补学习。这种协作方法增强了模型整合异构信息的能力，从而提高了模型在多模态环境中的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CycleGAN*: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data

With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio–visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN*, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN* exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN* exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model’s capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量