CLIP2TF:面向青少年教育的多模态视频-文本检索

IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Xiaoning Sun, Tao Fan, Hongxu Li, Guozhong Wang, Peien Ge, Xiwu Shang
{"title":"CLIP2TF:面向青少年教育的多模态视频-文本检索","authors":"Xiaoning Sun,&nbsp;Tao Fan,&nbsp;Hongxu Li,&nbsp;Guozhong Wang,&nbsp;Peien Ge,&nbsp;Xiwu Shang","doi":"10.1016/j.displa.2024.102801","DOIUrl":null,"url":null,"abstract":"<div><p>With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.</p></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"84 ","pages":"Article 102801"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP2TF:Multimodal video–text retrieval for adolescent education\",\"authors\":\"Xiaoning Sun,&nbsp;Tao Fan,&nbsp;Hongxu Li,&nbsp;Guozhong Wang,&nbsp;Peien Ge,&nbsp;Xiwu Shang\",\"doi\":\"10.1016/j.displa.2024.102801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.</p></div>\",\"PeriodicalId\":50570,\"journal\":{\"name\":\"Displays\",\"volume\":\"84 \",\"pages\":\"Article 102801\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Displays\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0141938224001653\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938224001653","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

随着人工智能技术的快速发展,尤其是在青少年教育领域,新的挑战和机遇不断涌现。当前的教育系统越来越需要实现教学活动检测和评估的自动化,这为提高青少年教育质量提供了新的视角。虽然大规模模型在教育研究中备受关注,但其对计算资源的高要求和在具体应用中的局限性限制了其在教育视频内容分析中的广泛应用,尤其是在处理多模态数据时。目前的多模态对比学习方法整合了视频、音频和文本信息,在视频文本检索任务中取得了一定的成功。然而,这些方法通常采用较简单的加权融合策略,无法避免噪声和信息冗余。因此,我们的研究提出了一个新颖的网络框架 CLIP2TF,其中包括一个高效的视听融合编码器。它旨在动态交互和整合视觉与音频特征,进一步增强特定教学场景中可能缺失或不足的视觉特征,同时有效减少模态融合过程中的冗余信息传输。通过对 MSRVTT 和 MSVD 数据集的消融实验,我们首先证明了 CLIP2TF 在视频-文本检索任务中的有效性。随后在教学视频数据集上的测试进一步证明了所提方法的适用性。这项研究不仅展示了人工智能在自动评估教学质量方面的潜力,也为相关领域的研究提供了新的方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CLIP2TF:Multimodal video–text retrieval for adolescent education

With the rapid advancement of artificial intelligence technology, particularly within the sphere of adolescent education, a continual emergence of new challenges and opportunities is observed. The current educational system increasingly requires the automation of teaching activities detection and evaluation, offering fresh perspectives for enhancing the quality of adolescent education. Although large-scale models are receiving significant attention in educational research, their high demand for computational resources and limitations in specific applications constrain their widespread use in analyzing educational video content, especially when handling multimodal data. Current multimodal contrastive learning methods, which integrate video, audio, and text information, have achieved certain successes in video–text retrieval tasks. However, these methods typically employ simpler weighted fusion strategies and fail to avoid noise and information redundancy. Therefore, our study proposes a novel network framework, CLIP2TF, which includes an efficient audio–visual fusion encoder. It aims to dynamically interact and integrate visual and audio features, further enhancing the visual features that may be missing or insufficient in specific teaching scenarios while effectively reducing redundant information transfer during the modality fusion process. Through ablation experiments on the MSRVTT and MSVD datasets, we first demonstrate the effectiveness of CLIP2TF in video–text retrieval tasks. Subsequent tests on teaching video datasets further proves the applicability of the proposed method. This research not only showcases the potential of artificial intelligence in the automated assessment of teaching quality but also provides new directions for research in related fields studies.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Displays
Displays 工程技术-工程:电子与电气
CiteScore
4.60
自引率
25.60%
发文量
138
审稿时长
92 days
期刊介绍: Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信