{"title":"双语、开放世界视频文本数据集和利用对比学习进行实时视频文本发现","authors":"Weijia Wu;Zhuang Li;Yuanqiang Cai;Hong Zhou;Mike Zheng Shou","doi":"10.1109/TCSVT.2024.3454331","DOIUrl":null,"url":null,"abstract":"Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting <inline-formula> <tex-math>$\\mathrm { ID_{F1}}$ </tex-math></inline-formula> of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"534-546"},"PeriodicalIF":8.3000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Bilingual, Open World Video Text Dataset and Real-Time Video Text Spotting With Contrastive Learning\",\"authors\":\"Weijia Wu;Zhuang Li;Yuanqiang Cai;Hong Zhou;Mike Zheng Shou\",\"doi\":\"10.1109/TCSVT.2024.3454331\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting <inline-formula> <tex-math>$\\\\mathrm { ID_{F1}}$ </tex-math></inline-formula> of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 1\",\"pages\":\"534-546\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10664465/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10664465/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Bilingual, Open World Video Text Dataset and Real-Time Video Text Spotting With Contrastive Learning
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText). There are four features for BOVText. Firstly, we provide 2,021 videos with more than 1,750,000 frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 32 open scenarios, including many virtual scenarios, e.g., Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in the video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures’ lives and communication. Besides, we propose a real-time end-to-end video text spotting with Contrastive Learning of Semantic and Visual Representation (CoText), which includes two advantages: 1) With a lightweight architecture, CoText simultaneously addresses the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) CoText tracks texts by comprehending them and relating them to each other with visual and semantic representations. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting $\mathrm { ID_{F1}}$ of 71.7% at 32.3 FPS on ICDAR2015video, with 10.2% and 23.3 FPS improvement the previous best method. The dataset and code of CoText can be found at: Dataset and CoText, respectively.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.