噪声对语音心理健康指标的影响

Masakazu Higuchi, Shuji Shinohara, M. Nakamura, S. Mitsuyoshi, S. Tokuno, Y. Omiya, Naoki Hagiwara, Takeshi Takano
{"title":"噪声对语音心理健康指标的影响","authors":"Masakazu Higuchi, Shuji Shinohara, M. Nakamura, S. Mitsuyoshi, S. Tokuno, Y. Omiya, Naoki Hagiwara, Takeshi Takano","doi":"10.1109/ICIIBMS.2017.8279690","DOIUrl":null,"url":null,"abstract":"In stressful modern society, mental health care is one of important issues. The authors have been developing methods to assess mental health status by voice. Analysis using voice has benefits such as, noninvasive, not necessary any specialized device, easy use, and remote-able monitoring. We focused on the pattern of voice frequency during in daily life telephone calls, and developed the Mind Monitoring System (MIMOSYS) which is the smartphone application to monitor the mental health status by voice during telephone calls. MIMOSYS uses voice emotion recognition technology (ST: Sensibility Technology) and outputs “Vitality” which is the indicator to denote the health status immediately after the telephone call and “Mental Activity” which is the indicator to denote the mid-to long-term health status. Higher vitality and Mental Activity values represent a better mental health status. We expect that the user can avoid behavior mental condition due to inducing behavior change, for example depression state, by monitoring mental health status daily using MIMOSYS. When using MIMOSYS, it is desirable to avoid noise as much as possible during telephone calls because empirically at least 7 utterances or more are appropriate for calculating the vitality and it is difficult to correctly detect utterances if noise is contained in the voice. However, environmental sounds will be included when talking in a hands-free manner, and it may cause analyzed results of incorrect mental health status because of unreliable vitality. In this study, we investigate the impact of various noises on the mental health status output by our voice analysis method. We used the sound corpus CENSREC-1-C provided by Speech Resources Consortium in the experiment. This corpus consists of two kinds of data, the simulated data by the noise-addition and the recording data in real environments. One voice data is a numeric string vocalized with several intervals and includes nine or ten utterances. The simulated data includes eight kinds of noise, such as Subway, Babble, Car, Exhibition, Restaurant, Street, Airport and Station. In each noisy environment, noises at SNRs from 20dB to −5dB every 5 dB increments are artificially added to clean voice data without noise. The number of speaker is 104 in this data set. The real environmental data includes two real-noisy environments, such as the university restaurant and the vicinity of highway. In each real environment, there are two SNR conditions, the lower and higher SNR conditions. Furthermore, voice data was recorded with close microphone and remote microphone synchronously in real environments. The number of speaker is 10 in this data set, Voice analysis was performed for both voice data sets. We used only “vitality” for this research because the data have only one time point data. As a result for the simulated data, the mean of vitality values for voice data at SNR of 20dB was lower than it for clean voice data in each environment, and the means of vitality values for voice data at noise levels noisier than 20dB tended to show higher values in five environments (Car, Restaurant, Street, Airport and Station). As a result for the real environmental data, the mean of vitality values for voice data recorded with remote microphone was higher than it with close microphone in any combination of two noisy environments and two SNR conditions. On the other hand, utterances detected in the voice analysis were almost correctly detected for clean voice data and voice data at SNR of 20dB in all environments in the simulated data, but the detection accuracies at noise levels noisier than 20dB were poor markedly. The similar trend was shown in the real environmental data, that is, the detection accuracy of utterances for voice data recorded with remote microphone was poor markedly in any combination of two noisy environments and two SNR conditions. This means that a noise part is incorrectly recognized as an utterance and the voice analysis is performed including the false utterance. Therefore, it is possible that the mental health status cannot be calculated correctly under noisier environments. From the above results, the noisy environment at SNR of about 20dB is the limit of the analysis under some noise. Moreover, changes of emotions calculated by ST were also investigated in the simulated data. As a result, the means of joy and sorrow components of voice data at SNR of 20dB were higher than these means of clean voice data in all noisy environments, and the mean of anger component of voice data at SNR of 20dB was lower than that mean of clean voice data in all noisy environments. This indicates that not only the speech detection but also the noise itself affects the analysis result of the mental health status. In the future, we will increase the accuracy of utterance detection under noise environment and verify that the accuracy of mental health status analysis will improve with an appropriate noise filter.","PeriodicalId":122969,"journal":{"name":"2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"An effect of noise on mental health indicator using voice\",\"authors\":\"Masakazu Higuchi, Shuji Shinohara, M. Nakamura, S. Mitsuyoshi, S. Tokuno, Y. Omiya, Naoki Hagiwara, Takeshi Takano\",\"doi\":\"10.1109/ICIIBMS.2017.8279690\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In stressful modern society, mental health care is one of important issues. The authors have been developing methods to assess mental health status by voice. Analysis using voice has benefits such as, noninvasive, not necessary any specialized device, easy use, and remote-able monitoring. We focused on the pattern of voice frequency during in daily life telephone calls, and developed the Mind Monitoring System (MIMOSYS) which is the smartphone application to monitor the mental health status by voice during telephone calls. MIMOSYS uses voice emotion recognition technology (ST: Sensibility Technology) and outputs “Vitality” which is the indicator to denote the health status immediately after the telephone call and “Mental Activity” which is the indicator to denote the mid-to long-term health status. Higher vitality and Mental Activity values represent a better mental health status. We expect that the user can avoid behavior mental condition due to inducing behavior change, for example depression state, by monitoring mental health status daily using MIMOSYS. When using MIMOSYS, it is desirable to avoid noise as much as possible during telephone calls because empirically at least 7 utterances or more are appropriate for calculating the vitality and it is difficult to correctly detect utterances if noise is contained in the voice. However, environmental sounds will be included when talking in a hands-free manner, and it may cause analyzed results of incorrect mental health status because of unreliable vitality. In this study, we investigate the impact of various noises on the mental health status output by our voice analysis method. We used the sound corpus CENSREC-1-C provided by Speech Resources Consortium in the experiment. This corpus consists of two kinds of data, the simulated data by the noise-addition and the recording data in real environments. One voice data is a numeric string vocalized with several intervals and includes nine or ten utterances. The simulated data includes eight kinds of noise, such as Subway, Babble, Car, Exhibition, Restaurant, Street, Airport and Station. In each noisy environment, noises at SNRs from 20dB to −5dB every 5 dB increments are artificially added to clean voice data without noise. The number of speaker is 104 in this data set. The real environmental data includes two real-noisy environments, such as the university restaurant and the vicinity of highway. In each real environment, there are two SNR conditions, the lower and higher SNR conditions. Furthermore, voice data was recorded with close microphone and remote microphone synchronously in real environments. The number of speaker is 10 in this data set, Voice analysis was performed for both voice data sets. We used only “vitality” for this research because the data have only one time point data. As a result for the simulated data, the mean of vitality values for voice data at SNR of 20dB was lower than it for clean voice data in each environment, and the means of vitality values for voice data at noise levels noisier than 20dB tended to show higher values in five environments (Car, Restaurant, Street, Airport and Station). As a result for the real environmental data, the mean of vitality values for voice data recorded with remote microphone was higher than it with close microphone in any combination of two noisy environments and two SNR conditions. On the other hand, utterances detected in the voice analysis were almost correctly detected for clean voice data and voice data at SNR of 20dB in all environments in the simulated data, but the detection accuracies at noise levels noisier than 20dB were poor markedly. The similar trend was shown in the real environmental data, that is, the detection accuracy of utterances for voice data recorded with remote microphone was poor markedly in any combination of two noisy environments and two SNR conditions. This means that a noise part is incorrectly recognized as an utterance and the voice analysis is performed including the false utterance. Therefore, it is possible that the mental health status cannot be calculated correctly under noisier environments. From the above results, the noisy environment at SNR of about 20dB is the limit of the analysis under some noise. Moreover, changes of emotions calculated by ST were also investigated in the simulated data. As a result, the means of joy and sorrow components of voice data at SNR of 20dB were higher than these means of clean voice data in all noisy environments, and the mean of anger component of voice data at SNR of 20dB was lower than that mean of clean voice data in all noisy environments. This indicates that not only the speech detection but also the noise itself affects the analysis result of the mental health status. In the future, we will increase the accuracy of utterance detection under noise environment and verify that the accuracy of mental health status analysis will improve with an appropriate noise filter.\",\"PeriodicalId\":122969,\"journal\":{\"name\":\"2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIIBMS.2017.8279690\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIIBMS.2017.8279690","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

在紧张的现代社会,心理健康是一个重要的问题。作者一直在开发通过声音评估心理健康状况的方法。使用语音进行分析具有非侵入性、不需要任何专门设备、易于使用和远程监控等优点。我们专注于日常生活中电话通话时的语音频率模式,开发了一款智能手机应用程序Mind Monitoring System (MIMOSYS),通过语音来监测电话通话时的心理健康状况。MIMOSYS使用语音情感识别技术(ST: Sensibility technology),输出“Vitality”(活力)和“Mental Activity”(精神活动)(精神活动),这两个指标分别表示通话后立即的健康状态和中长期的健康状态。较高的活力值和心理活动量值代表较好的心理健康状况。我们期望通过每天使用MIMOSYS监测心理健康状况,用户可以避免由于诱导行为改变而导致的行为心理状况,例如抑郁状态。在使用MIMOSYS时,希望在通话过程中尽可能避免噪声,因为经验上至少有7个或更多的话语适合计算活力,如果声音中包含噪声,则难以正确检测话语。但是,免提通话时,会包含环境声音,并且由于生命力不可靠,可能会导致心理健康状况的分析结果不正确。在本研究中,我们用我们的声音分析方法来研究各种噪音对心理健康状态输出的影响。实验使用语音资源联盟提供的语音语料库CENSREC-1-C。该语料库由两类数据组成,一类是加噪后的模拟数据,另一类是真实环境下的记录数据。一个语音数据是由几个间隔发出的数字字符串,包括九个或十个语音。模拟数据包括地铁、嘈杂声、汽车、展览、餐厅、街道、机场和车站等8种噪声。在每个噪声环境中,人为添加信噪比为20dB ~ - 5dB的噪声,每增加5dB,以清洁无噪声的语音数据。在这个数据集中,说话人的数量是104。真实环境数据包括两种真实噪声环境,如大学餐厅和高速公路附近。在每个实际环境中,都存在两种信噪比条件:低信噪比条件和高信噪比条件。在真实环境中,用近端麦克风和远端麦克风同步录制语音数据。该数据集中的说话人数量为10,对两个语音数据集进行语音分析。由于数据只有一个时间点数据,所以我们只使用了“vitality”。结果表明,对于模拟数据,各环境中信噪比为20dB的语音数据的活力均值均低于清洁语音数据,噪声水平高于20dB的语音数据的活力均值在5个环境(汽车、餐厅、街道、机场和车站)中呈现出较高的趋势。结果表明,对于真实环境数据,在两种噪声环境和两种信噪比条件的任意组合下,远端麦克风记录的语音数据的活力值均值均高于近端麦克风记录的语音数据。另一方面,在模拟数据的所有环境中,对于干净的语音数据和信噪比为20dB的语音数据,语音分析检测到的话语几乎都能正确检测到,但在噪声水平大于20dB的情况下,检测精度明显较差。在实际环境数据中也显示出类似的趋势,即在两种噪声环境和两种信噪比条件的任意组合下,远程麦克风记录的语音数据的话语检测精度都明显较差。这意味着噪声部分被错误地识别为话语,并进行包括虚假话语在内的语音分析。因此,在噪声环境下,心理健康状况可能无法正确计算。从以上结果可以看出,信噪比在20dB左右的噪声环境是在一定噪声下分析的极限。此外,还研究了模拟数据中由ST计算的情绪变化。结果表明,在所有噪声环境下,信噪比为20dB的语音数据的快乐和悲伤分量均值均高于干净语音数据的均值,而信噪比为20dB的语音数据的愤怒分量均值均低于干净语音数据的均值。这说明除了语音检测外,噪声本身也会影响心理健康状况的分析结果。在未来,我们将提高噪声环境下的话语检测的准确性,并验证适当的噪声滤波器将提高心理健康状态分析的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An effect of noise on mental health indicator using voice
In stressful modern society, mental health care is one of important issues. The authors have been developing methods to assess mental health status by voice. Analysis using voice has benefits such as, noninvasive, not necessary any specialized device, easy use, and remote-able monitoring. We focused on the pattern of voice frequency during in daily life telephone calls, and developed the Mind Monitoring System (MIMOSYS) which is the smartphone application to monitor the mental health status by voice during telephone calls. MIMOSYS uses voice emotion recognition technology (ST: Sensibility Technology) and outputs “Vitality” which is the indicator to denote the health status immediately after the telephone call and “Mental Activity” which is the indicator to denote the mid-to long-term health status. Higher vitality and Mental Activity values represent a better mental health status. We expect that the user can avoid behavior mental condition due to inducing behavior change, for example depression state, by monitoring mental health status daily using MIMOSYS. When using MIMOSYS, it is desirable to avoid noise as much as possible during telephone calls because empirically at least 7 utterances or more are appropriate for calculating the vitality and it is difficult to correctly detect utterances if noise is contained in the voice. However, environmental sounds will be included when talking in a hands-free manner, and it may cause analyzed results of incorrect mental health status because of unreliable vitality. In this study, we investigate the impact of various noises on the mental health status output by our voice analysis method. We used the sound corpus CENSREC-1-C provided by Speech Resources Consortium in the experiment. This corpus consists of two kinds of data, the simulated data by the noise-addition and the recording data in real environments. One voice data is a numeric string vocalized with several intervals and includes nine or ten utterances. The simulated data includes eight kinds of noise, such as Subway, Babble, Car, Exhibition, Restaurant, Street, Airport and Station. In each noisy environment, noises at SNRs from 20dB to −5dB every 5 dB increments are artificially added to clean voice data without noise. The number of speaker is 104 in this data set. The real environmental data includes two real-noisy environments, such as the university restaurant and the vicinity of highway. In each real environment, there are two SNR conditions, the lower and higher SNR conditions. Furthermore, voice data was recorded with close microphone and remote microphone synchronously in real environments. The number of speaker is 10 in this data set, Voice analysis was performed for both voice data sets. We used only “vitality” for this research because the data have only one time point data. As a result for the simulated data, the mean of vitality values for voice data at SNR of 20dB was lower than it for clean voice data in each environment, and the means of vitality values for voice data at noise levels noisier than 20dB tended to show higher values in five environments (Car, Restaurant, Street, Airport and Station). As a result for the real environmental data, the mean of vitality values for voice data recorded with remote microphone was higher than it with close microphone in any combination of two noisy environments and two SNR conditions. On the other hand, utterances detected in the voice analysis were almost correctly detected for clean voice data and voice data at SNR of 20dB in all environments in the simulated data, but the detection accuracies at noise levels noisier than 20dB were poor markedly. The similar trend was shown in the real environmental data, that is, the detection accuracy of utterances for voice data recorded with remote microphone was poor markedly in any combination of two noisy environments and two SNR conditions. This means that a noise part is incorrectly recognized as an utterance and the voice analysis is performed including the false utterance. Therefore, it is possible that the mental health status cannot be calculated correctly under noisier environments. From the above results, the noisy environment at SNR of about 20dB is the limit of the analysis under some noise. Moreover, changes of emotions calculated by ST were also investigated in the simulated data. As a result, the means of joy and sorrow components of voice data at SNR of 20dB were higher than these means of clean voice data in all noisy environments, and the mean of anger component of voice data at SNR of 20dB was lower than that mean of clean voice data in all noisy environments. This indicates that not only the speech detection but also the noise itself affects the analysis result of the mental health status. In the future, we will increase the accuracy of utterance detection under noise environment and verify that the accuracy of mental health status analysis will improve with an appropriate noise filter.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信