{"title":"基于支持向量机的音频信号语音与非语音分割","authors":"T. Danisman, A. Alpkocak","doi":"10.1109/SIU.2007.4298688","DOIUrl":null,"url":null,"abstract":"In this study, we have presented a speech vs nonspeech segmentation of audio signals extracted from video. We have used 4330 seconds of audio signal extracted from \"Lost\" TV series for training. Our training set is automatically builded by using timestamp information exists in subtitles. After that, silence areas within those speech areas are discarded with a further study. Then, standard deviation of MFCC feature vectors of size 20 have been obtained. Finally, Support Vector Machines (SVM) is used with one-vs-all method for the classification. We have used 7545 seconds of audio signal from \"Lost\" and \"How I Met Your Mother\" TV Series. We achieved an overall accuracy of 87.77% for speech vs non-speech segmentation and 90.33% recall value for non-speech classes.","PeriodicalId":315147,"journal":{"name":"2007 IEEE 15th Signal Processing and Communications Applications","volume":"160 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Speech vs Nonspeech Segmentation of Audio Signals Using Support Vector Machines\",\"authors\":\"T. Danisman, A. Alpkocak\",\"doi\":\"10.1109/SIU.2007.4298688\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we have presented a speech vs nonspeech segmentation of audio signals extracted from video. We have used 4330 seconds of audio signal extracted from \\\"Lost\\\" TV series for training. Our training set is automatically builded by using timestamp information exists in subtitles. After that, silence areas within those speech areas are discarded with a further study. Then, standard deviation of MFCC feature vectors of size 20 have been obtained. Finally, Support Vector Machines (SVM) is used with one-vs-all method for the classification. We have used 7545 seconds of audio signal from \\\"Lost\\\" and \\\"How I Met Your Mother\\\" TV Series. We achieved an overall accuracy of 87.77% for speech vs non-speech segmentation and 90.33% recall value for non-speech classes.\",\"PeriodicalId\":315147,\"journal\":{\"name\":\"2007 IEEE 15th Signal Processing and Communications Applications\",\"volume\":\"160 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 IEEE 15th Signal Processing and Communications Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU.2007.4298688\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE 15th Signal Processing and Communications Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU.2007.4298688","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speech vs Nonspeech Segmentation of Audio Signals Using Support Vector Machines
In this study, we have presented a speech vs nonspeech segmentation of audio signals extracted from video. We have used 4330 seconds of audio signal extracted from "Lost" TV series for training. Our training set is automatically builded by using timestamp information exists in subtitles. After that, silence areas within those speech areas are discarded with a further study. Then, standard deviation of MFCC feature vectors of size 20 have been obtained. Finally, Support Vector Machines (SVM) is used with one-vs-all method for the classification. We have used 7545 seconds of audio signal from "Lost" and "How I Met Your Mother" TV Series. We achieved an overall accuracy of 87.77% for speech vs non-speech segmentation and 90.33% recall value for non-speech classes.