双语、开放世界视频文本数据集和利用对比学习进行实时视频文本发现

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-04 DOI:10.1109/TCSVT.2024.3454331

Weijia Wu;Zhuang Li;Yuanqiang Cai;Hong Zhou;Mike Zheng Shou

{"title":"双语、开放世界视频文本数据集和利用对比学习进行实时视频文本发现","authors":"Weijia Wu;Zhuang Li;Yuanqiang Cai;Hong Zhou;Mike Zheng Shou","doi":"10.1109/TCSVT.2024.3454331","DOIUrl":null,"url":null,"abstract":"Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting <inline-formula> <tex-math>$\\mathrm { ID_{F1}}$ </tex-math></inline-formula> of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"534-546"},"PeriodicalIF":8.3000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Bilingual, Open World Video Text Dataset and Real-Time Video Text Spotting With Contrastive Learning\",\"authors\":\"Weijia Wu;Zhuang Li;Yuanqiang Cai;Hong Zhou;Mike Zheng Shou\",\"doi\":\"10.1109/TCSVT.2024.3454331\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting <inline-formula> <tex-math>$\\\\mathrm { ID_{F1}}$ </tex-math></inline-formula> of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 1\",\"pages\":\"534-546\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10664465/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10664465/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Bilingual, Open World Video Text Dataset and Real-Time Video Text Spotting With Contrastive Learning

Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting

$\mathrm { ID_{F1}}$

of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.