R. Cole, D. Novick, M. Fanty, Pieter J. E. Vermeulen, S. Sutton, D. Burnett, J. Schalkwyk
{"title":"美国人口普查语音应答问卷的原型","authors":"R. Cole, D. Novick, M. Fanty, Pieter J. E. Vermeulen, S. Sutton, D. Burnett, J. Schalkwyk","doi":"10.21437/ICSLP.1994-173","DOIUrl":null,"url":null,"abstract":"APROTOTYPEVOICE-RESPONSEQUESTIONNAIREFORTHEU.S.CENSUSRonald Cole, David G. Novick, Mark Fanty,Pieter Vermeulen, Stephen Sutton, Dan Burnett and Johan SchalkwykCenter for Sp oken Language UnderstandingOregon Graduate Institute of Science and Technology20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USAABSTRACTThis pap er describ es a study conducted to determine thefeasibilityof using a sp okenquestionnaireto collect infor-mationfortheYear2000CensusinUSA.Tore nethedialogueandtotrainrecognizers,wecollectedcom-pleteproto colsfromover4000callers.Fortheresp onseslab eled(ab outhalf ),over99p ercentoftheanswerscon-tain the desired information.The recognizers trained so farrangeinp erformancefrom75p ercentcorrectonyearofbirth to over 99 p ercent for maritalstatus.We develop eda prototyp esystemthat engagesthecallersina dialoguetoobtainthedesiredinformation,reviewsrecognizedinformationatthe endof thecall,andasksthecallertoidentify the resp onse categories that are incorrect.1.INTRODUCTIONWehavconductedastudytodeterminethefeasibilityof usingan automatedsp okenquestionnaireto collectin-formationfortheYear2000CensusinUnitedStatesofAmerica.Thegoalthestudywastodevelopandevaluate a telephone questionnaire that automaticall y cap-turesandrecognizesthefollowinginformation:(1)fullname, (2) sex, (3) birth date, (4) marital status (now mar-ried, widowed, divorced, separated, never married|cho oseone),(5)Hispanicorigin(yesorno);ifHispanic:Mexi-can, Mexican-American, Chicano, Puerto Rican, Cuban orother (sp eci y), (6) race:White, Black or Negro, AmericanIndian(sp ecifytrib e),Eskimo,Aleut,Chinese,Japanese,Filipino ,AsianIndian,Hawaiian,Samoan,Korean,Gua-manian, Vietnamese or other (sp ecify).After preliminaryrounds of data collectionto re ne theselectionandwordingof the system prompts,a large,re-gionallydiversedatacollectione ortresultedinapproxi-mately4000calls.Thispap erdescrib esthee ectivenessof theproto colinelicitingthedesiredinformationanditdescrib es the sp oken language system that resulted.2.SYSTEM2.1.RecognitionSignal Pro cessing.The caller's resp onse is transmitted overthe digital phone line as a 8 kHz mu-law enco ded digital sig-nal.A seventhorderPerceptualLinearPredictive(PLP)analysis [1] is p erformed every 6 msec using a 10 msec win-dow.Phonetic Classi cation.Each 6 msec frame of the signalisclassi edphoneticall ybyathreelaerneuralnetwork.To achieve maximum p erformance, a separate vo cabulary-dep endentnetworkistrainedforeachresp onsecategory,using a phoneme set particular to the exp ected pronuncia-tionsof words in that resp onsecategory.This consistsofthesubsetofstandardphonemeswhicho ccurinvo-cabulary, plus any additional context-dep endent phonemeswhichweredeemednecessary(e.g.[tw]forthe[t]in\\twenty\" and \\telve\").The background noise and silenceare mo deled by a sp ecial phoneme [.pau].For each frame of sp eech, the neural network is providedwith 70 inputs, which consists of eight PLP co e\u000ecients andtwovoicingoutputsfromtheframetob eclassi edandaveraged oer each of the following regions b efore and afterthe frame to b e classi ed:6 to 18 msec, 36 to 48 msec and72 to 84 msec.The two inputs that estimate voicing for each frame areprovidedby a separate three-layerneuralnetwork trainedonvoicedandoicelesssp eechframesfromtendi erentlanguages.Althoughthe voicingclassi eristrainedwiththe same PLP features describ edab ove, exp eriments haeshown that includin gthese features improves classi catio np erformance.The outputs of the network fall in the range (0,1) b ecauseof the sigmoid transfer function, and, ideally,approximatethea posterioriprobability of that phoneme given the input[2]. These values are divided by the prior probabili ty of thephoneme in the training set [3].Training the Classi ers.rainingthe neural network re-quiredphoneticall ysegmenteddata.Weusedasemi-automaticpro cedurethatinvolvedhandtranscriptionatthe word levelof ab out a quarter of the corpus and auto-maticgenerationof\\forced\"phoneticalignmentthesetranscriptionsusing a classi er trained on a di erent task.Anewclassi erwasthentrainedonautomaticallyaligned census data and used to realign it. The pro cess wasrep eated a couple of times until p erformance asymptoted.Anequalnumb eroftrainingsamples(approximately1000) was used for each phoneme class.As a consequence,rarephonemesweresampledmore nelythancommonphonemes.Trainingexamplesforbackgroundnoiseandsilencewerechosensuchthatatleasthalfo ccurclosetophonemeb oundaries.Thisbalancingwas needed to trainfor prop er discriminati on b etween the background class andunvoiced closures.Theneuralnetworkas trainedusingbackpropagationProcedings of. ICSLP-94, Sept.19941IEEE 1994","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"83 1","pages":"683-686"},"PeriodicalIF":0.0000,"publicationDate":"1994-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"A prototype voice-response questionnaire for the u.s. census\",\"authors\":\"R. Cole, D. Novick, M. Fanty, Pieter J. E. Vermeulen, S. Sutton, D. Burnett, J. Schalkwyk\",\"doi\":\"10.21437/ICSLP.1994-173\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"APROTOTYPEVOICE-RESPONSEQUESTIONNAIREFORTHEU.S.CENSUSRonald Cole, David G. Novick, Mark Fanty,Pieter Vermeulen, Stephen Sutton, Dan Burnett and Johan SchalkwykCenter for Sp oken Language UnderstandingOregon Graduate Institute of Science and Technology20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USAABSTRACTThis pap er describ es a study conducted to determine thefeasibilityof using a sp okenquestionnaireto collect infor-mationfortheYear2000CensusinUSA.Tore nethedialogueandtotrainrecognizers,wecollectedcom-pleteproto colsfromover4000callers.Fortheresp onseslab eled(ab outhalf ),over99p ercentoftheanswerscon-tain the desired information.The recognizers trained so farrangeinp erformancefrom75p ercentcorrectonyearofbirth to over 99 p ercent for maritalstatus.We develop eda prototyp esystemthat engagesthecallersina dialoguetoobtainthedesiredinformation,reviewsrecognizedinformationatthe endof thecall,andasksthecallertoidentify the resp onse categories that are incorrect.1.INTRODUCTIONWehavconductedastudytodeterminethefeasibilityof usingan automatedsp okenquestionnaireto collectin-formationfortheYear2000CensusinUnitedStatesofAmerica.Thegoalthestudywastodevelopandevaluate a telephone questionnaire that automaticall y cap-turesandrecognizesthefollowinginformation:(1)fullname, (2) sex, (3) birth date, (4) marital status (now mar-ried, widowed, divorced, separated, never married|cho oseone),(5)Hispanicorigin(yesorno);ifHispanic:Mexi-can, Mexican-American, Chicano, Puerto Rican, Cuban orother (sp eci y), (6) race:White, Black or Negro, AmericanIndian(sp ecifytrib e),Eskimo,Aleut,Chinese,Japanese,Filipino ,AsianIndian,Hawaiian,Samoan,Korean,Gua-manian, Vietnamese or other (sp ecify).After preliminaryrounds of data collectionto re ne theselectionandwordingof the system prompts,a large,re-gionallydiversedatacollectione ortresultedinapproxi-mately4000calls.Thispap erdescrib esthee ectivenessof theproto colinelicitingthedesiredinformationanditdescrib es the sp oken language system that resulted.2.SYSTEM2.1.RecognitionSignal Pro cessing.The caller's resp onse is transmitted overthe digital phone line as a 8 kHz mu-law enco ded digital sig-nal.A seventhorderPerceptualLinearPredictive(PLP)analysis [1] is p erformed every 6 msec using a 10 msec win-dow.Phonetic Classi cation.Each 6 msec frame of the signalisclassi edphoneticall ybyathreelaerneuralnetwork.To achieve maximum p erformance, a separate vo cabulary-dep endentnetworkistrainedforeachresp onsecategory,using a phoneme set particular to the exp ected pronuncia-tionsof words in that resp onsecategory.This consistsofthesubsetofstandardphonemeswhicho ccurinvo-cabulary, plus any additional context-dep endent phonemeswhichweredeemednecessary(e.g.[tw]forthe[t]in\\\\twenty\\\" and \\\\telve\\\").The background noise and silenceare mo deled by a sp ecial phoneme [.pau].For each frame of sp eech, the neural network is providedwith 70 inputs, which consists of eight PLP co e\\u000ecients andtwovoicingoutputsfromtheframetob eclassi edandaveraged oer each of the following regions b efore and afterthe frame to b e classi ed:6 to 18 msec, 36 to 48 msec and72 to 84 msec.The two inputs that estimate voicing for each frame areprovidedby a separate three-layerneuralnetwork trainedonvoicedandoicelesssp eechframesfromtendi erentlanguages.Althoughthe voicingclassi eristrainedwiththe same PLP features describ edab ove, exp eriments haeshown that includin gthese features improves classi catio np erformance.The outputs of the network fall in the range (0,1) b ecauseof the sigmoid transfer function, and, ideally,approximatethea posterioriprobability of that phoneme given the input[2]. These values are divided by the prior probabili ty of thephoneme in the training set [3].Training the Classi ers.rainingthe neural network re-quiredphoneticall ysegmenteddata.Weusedasemi-automaticpro cedurethatinvolvedhandtranscriptionatthe word levelof ab out a quarter of the corpus and auto-maticgenerationof\\\\forced\\\"phoneticalignmentthesetranscriptionsusing a classi er trained on a di erent task.Anewclassi erwasthentrainedonautomaticallyaligned census data and used to realign it. The pro cess wasrep eated a couple of times until p erformance asymptoted.Anequalnumb eroftrainingsamples(approximately1000) was used for each phoneme class.As a consequence,rarephonemesweresampledmore nelythancommonphonemes.Trainingexamplesforbackgroundnoiseandsilencewerechosensuchthatatleasthalfo ccurclosetophonemeb oundaries.Thisbalancingwas needed to trainfor prop er discriminati on b etween the background class andunvoiced closures.Theneuralnetworkas trainedusingbackpropagationProcedings of. ICSLP-94, Sept.19941IEEE 1994\",\"PeriodicalId\":90685,\"journal\":{\"name\":\"Proceedings : ICSLP. International Conference on Spoken Language Processing\",\"volume\":\"83 1\",\"pages\":\"683-686\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1994-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings : ICSLP. International Conference on Spoken Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/ICSLP.1994-173\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings : ICSLP. International Conference on Spoken Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ICSLP.1994-173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25
摘要
APROTOTYPEVOICE-RESPONSEQUESTIONNAIREFORTHEU.S。人口普查ronald Cole, David G. Novick, Mark Fanty,Pieter Vermeulen, Stephen Sutton, Dan Burnett和Johan schalkwyk口语理解中心俄勒冈州科学技术研究生院,20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, usa摘要本文描述了一项研究,以确定使用口语问卷收集2000年美国人口普查信息的可行性。除了对话和训练识别器之外,我们从4000多名呼叫者那里收集了完整的原型。因此,在实验室测试中,超过99%的答案都包含了所需的信息。接受过培训的识别器的准确率从出生年份的75%到婚姻状况的99%以上不等。我们开发了一个原型系统,该系统与过敏原对话以获得所需的信息,在通话结束时审查已识别的信息,并要求过敏原识别不正确的响应类别。我们进行了一项研究,以确定在美国2000年人口普查中使用自动填写问卷收集信息的可行性。该研究的目的是开发和评估一份电话问卷,该问卷可以自动记录和识别以下信息:(1)全名,(2)性别,(3)出生日期,(4)婚姻状况(已婚,丧偶,离婚,分居,未婚/未婚),(5)西班牙裔(是或不是);如果西班牙裔:墨西哥裔,墨西哥裔美国人,墨西哥裔美国人,波多黎各人,古巴人或其他(特别),(6)种族:白人,黑人或黑人,美洲印第安人(特定部落)、爱斯基摩人、阿留申人、中国人、日本人、菲律宾人、亚洲印第安人、夏威夷人、萨摩亚人、韩国人、关岛人、越南人或其他(特定部落)。经过初步的数据收集,以确定系统提示符的选择和措辞,一个大型的、区域多样化的数据收集产生了大约4000个调用。这篇文章描述了原型的有效性,它共同引出了所需的信息,并描述了由此产生的口语系统。识别信号处理。呼叫者的应答通过数字电话线作为8千赫mu-law编码的数字信号传输。使用10毫秒的窗口,每6毫秒执行一次七阶感知线性预测(PLP)分析[1]。语音分类。每6毫秒的信号帧由3个神经网络进行语音分类。为了获得最大的性能,我们为每个类别训练了一个单独的词库深度词尾网络,使用特定于该类别中单词预期发音的音素集。这包括在词汇表中出现的标准音素的子集,以及被认为必要的任何额外的上下文相关音素(例如:[tw]代表“20”和“telve”中的[t])。背景噪音和寂静不是由一个特殊的音素所消除的。对于每一帧语音,神经网络提供70个输入,其中包括8个PLP客户端和来自帧的两个语音输出,并在帧被分类之前和之后对以下每个区域进行平均:6至18毫秒,36至48毫秒和72至84毫秒。估计每帧语音的两个输入是由一个独立的三层神经网络提供的,该神经网络由来自不同语言的语音和非语音帧训练而成。虽然语音分类与前面描述的PLP特征相同,但实验表明,包括这些特征可以提高分类分类的性能。由于s型传递函数,网络的输出落在(0,1)b范围内,并且在理想情况下,近似于给定输入的音素的后概率[2]。这些值除以训练集中音素的先验概率[3]。培训学员。训练神经网络需要语音上所有的分段数据。我们使用了半自动化的程序,其中包括在语料库的四分之一的单词水平上进行手工转录,并使用在不同任务上训练过的分类器自动生成“强制”语音对齐这些转录。一种新的分类方法被引入到自动对齐的人口普查数据中,并用于重新对齐。这个过程重复了几次,直到性能渐进式。每个音素类使用了相等数量的训练样本(大约1000个)。因此,罕见的音素比常见的音素更容易被采样。背景噪音和沉默的训练示例至少有一半发生在音素边界附近。这种平衡是训练背景类和未发音闭包之间的道具歧视所必需的。的反向传播过程训练神经网络。中国机械工程学报,1998,11 (2):481 - 481
A prototype voice-response questionnaire for the u.s. census
APROTOTYPEVOICE-RESPONSEQUESTIONNAIREFORTHEU.S.CENSUSRonald Cole, David G. Novick, Mark Fanty,Pieter Vermeulen, Stephen Sutton, Dan Burnett and Johan SchalkwykCenter for Sp oken Language UnderstandingOregon Graduate Institute of Science and Technology20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USAABSTRACTThis pap er describ es a study conducted to determine thefeasibilityof using a sp okenquestionnaireto collect infor-mationfortheYear2000CensusinUSA.Tore nethedialogueandtotrainrecognizers,wecollectedcom-pleteproto colsfromover4000callers.Fortheresp onseslab eled(ab outhalf ),over99p ercentoftheanswerscon-tain the desired information.The recognizers trained so farrangeinp erformancefrom75p ercentcorrectonyearofbirth to over 99 p ercent for maritalstatus.We develop eda prototyp esystemthat engagesthecallersina dialoguetoobtainthedesiredinformation,reviewsrecognizedinformationatthe endof thecall,andasksthecallertoidentify the resp onse categories that are incorrect.1.INTRODUCTIONWehavconductedastudytodeterminethefeasibilityof usingan automatedsp okenquestionnaireto collectin-formationfortheYear2000CensusinUnitedStatesofAmerica.Thegoalthestudywastodevelopandevaluate a telephone questionnaire that automaticall y cap-turesandrecognizesthefollowinginformation:(1)fullname, (2) sex, (3) birth date, (4) marital status (now mar-ried, widowed, divorced, separated, never married|cho oseone),(5)Hispanicorigin(yesorno);ifHispanic:Mexi-can, Mexican-American, Chicano, Puerto Rican, Cuban orother (sp eci y), (6) race:White, Black or Negro, AmericanIndian(sp ecifytrib e),Eskimo,Aleut,Chinese,Japanese,Filipino ,AsianIndian,Hawaiian,Samoan,Korean,Gua-manian, Vietnamese or other (sp ecify).After preliminaryrounds of data collectionto re ne theselectionandwordingof the system prompts,a large,re-gionallydiversedatacollectione ortresultedinapproxi-mately4000calls.Thispap erdescrib esthee ectivenessof theproto colinelicitingthedesiredinformationanditdescrib es the sp oken language system that resulted.2.SYSTEM2.1.RecognitionSignal Pro cessing.The caller's resp onse is transmitted overthe digital phone line as a 8 kHz mu-law enco ded digital sig-nal.A seventhorderPerceptualLinearPredictive(PLP)analysis [1] is p erformed every 6 msec using a 10 msec win-dow.Phonetic Classi cation.Each 6 msec frame of the signalisclassi edphoneticall ybyathreelaerneuralnetwork.To achieve maximum p erformance, a separate vo cabulary-dep endentnetworkistrainedforeachresp onsecategory,using a phoneme set particular to the exp ected pronuncia-tionsof words in that resp onsecategory.This consistsofthesubsetofstandardphonemeswhicho ccurinvo-cabulary, plus any additional context-dep endent phonemeswhichweredeemednecessary(e.g.[tw]forthe[t]in\twenty" and \telve").The background noise and silenceare mo deled by a sp ecial phoneme [.pau].For each frame of sp eech, the neural network is providedwith 70 inputs, which consists of eight PLP co ecients andtwovoicingoutputsfromtheframetob eclassi edandaveraged oer each of the following regions b efore and afterthe frame to b e classi ed:6 to 18 msec, 36 to 48 msec and72 to 84 msec.The two inputs that estimate voicing for each frame areprovidedby a separate three-layerneuralnetwork trainedonvoicedandoicelesssp eechframesfromtendi erentlanguages.Althoughthe voicingclassi eristrainedwiththe same PLP features describ edab ove, exp eriments haeshown that includin gthese features improves classi catio np erformance.The outputs of the network fall in the range (0,1) b ecauseof the sigmoid transfer function, and, ideally,approximatethea posterioriprobability of that phoneme given the input[2]. These values are divided by the prior probabili ty of thephoneme in the training set [3].Training the Classi ers.rainingthe neural network re-quiredphoneticall ysegmenteddata.Weusedasemi-automaticpro cedurethatinvolvedhandtranscriptionatthe word levelof ab out a quarter of the corpus and auto-maticgenerationof\forced"phoneticalignmentthesetranscriptionsusing a classi er trained on a di erent task.Anewclassi erwasthentrainedonautomaticallyaligned census data and used to realign it. The pro cess wasrep eated a couple of times until p erformance asymptoted.Anequalnumb eroftrainingsamples(approximately1000) was used for each phoneme class.As a consequence,rarephonemesweresampledmore nelythancommonphonemes.Trainingexamplesforbackgroundnoiseandsilencewerechosensuchthatatleasthalfo ccurclosetophonemeb oundaries.Thisbalancingwas needed to trainfor prop er discriminati on b etween the background class andunvoiced closures.Theneuralnetworkas trainedusingbackpropagationProcedings of. ICSLP-94, Sept.19941IEEE 1994