自动扬声器分析2.0:听到更大的画面

2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) Pub Date : 2017-07-01 DOI:10.1109/SPED.2017.7990449

Björn Schuller

{"title":"自动扬声器分析2.0:听到更大的画面","authors":"Björn Schuller","doi":"10.1109/SPED.2017.7990449","DOIUrl":null,"url":null,"abstract":"Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis — assessing the ‘larger picture’ in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach ‘big & rich’ speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the ‘entire’ person and her condition and whereabouts behind the voice and words — rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic speaker analysis 2.0: Hearing the bigger picture\",\"authors\":\"Björn Schuller\",\"doi\":\"10.1109/SPED.2017.7990449\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis — assessing the ‘larger picture’ in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach ‘big & rich’ speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the ‘entire’ person and her condition and whereabouts behind the voice and words — rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.\",\"PeriodicalId\":345314,\"journal\":{\"name\":\"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)\",\"volume\":\"88 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPED.2017.7990449\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPED.2017.7990449","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自动说话人分析主要集中在说话人的单个方面，比如她的身份、性别、情感、个性或健康状况。这基本上忽略了所有不同状态和特征的相互依赖性，这些状态和特征影响着人类说话者可用的单一语音产生机制。换句话说，有时我们可能听起来很沮丧，但我们只是得了流感，几乎没有精力把更多的精力投入到我们的发音和声音制作中。最近，这种缺乏引发了一种越来越全面的说话者分析——通过一次多目标学习来评估“大局”。然而，为了进行稳健的评估，这需要以丰富的方式标记大量的语音和语言资源来训练这种相互依赖性，并且需要能够应对大量语音数据的多目标学习的架构。从这个角度来看，本文将讨论有效的机制，如大型社交媒体预扫描与动态合作众包，以快速收集数据，跨任务标记这些数据在更广泛的属性范围内，以获得“大而丰富”的语音数据，以及高效的多目标端到端和端到端进化深度学习范式，以有效的方式学习不同目标任务的相应丰富表示。他们的最终目标是让机器能够听到声音和文字背后的“整个”人，以及她的状况和下落，而不是只针对一个方面，对整个人及其状态视而不见，从而导致下一个层次的自动扬声器分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic speaker analysis 2.0: Hearing the bigger picture

Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality, or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis — assessing the ‘larger picture’ in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this contribution will discuss efficient mechanisms such as large socialmedia pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach ‘big & rich’ speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to hear the ‘entire’ person and her condition and whereabouts behind the voice and words — rather than aiming at a single aspect blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)

自引率

0.00%

发文量