Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian
{"title":"从AV语音基准中提取纯度视频挑战的初步研究","authors":"Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian","doi":"10.1145/3517077.3517091","DOIUrl":null,"url":null,"abstract":"Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.","PeriodicalId":233686,"journal":{"name":"2022 7th International Conference on Multimedia and Image Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark\",\"authors\":\"Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian\",\"doi\":\"10.1145/3517077.3517091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.\",\"PeriodicalId\":233686,\"journal\":{\"name\":\"2022 7th International Conference on Multimedia and Image Processing\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Multimedia and Image Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3517077.3517091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Multimedia and Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3517077.3517091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark
Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.