Masakazu Higuchi, Shuji Shinohara, M. Nakamura, S. Mitsuyoshi, S. Tokuno, Y. Omiya, Naoki Hagiwara, Takeshi Takano
{"title":"An effect of noise on mental health indicator using voice","authors":"Masakazu Higuchi, Shuji Shinohara, M. Nakamura, S. Mitsuyoshi, S. Tokuno, Y. Omiya, Naoki Hagiwara, Takeshi Takano","doi":"10.1109/ICIIBMS.2017.8279690","DOIUrl":null,"url":null,"abstract":"In stressful modern society, mental health care is one of important issues. The authors have been developing methods to assess mental health status by voice. Analysis using voice has benefits such as, noninvasive, not necessary any specialized device, easy use, and remote-able monitoring. We focused on the pattern of voice frequency during in daily life telephone calls, and developed the Mind Monitoring System (MIMOSYS) which is the smartphone application to monitor the mental health status by voice during telephone calls. MIMOSYS uses voice emotion recognition technology (ST: Sensibility Technology) and outputs “Vitality” which is the indicator to denote the health status immediately after the telephone call and “Mental Activity” which is the indicator to denote the mid-to long-term health status. Higher vitality and Mental Activity values represent a better mental health status. We expect that the user can avoid behavior mental condition due to inducing behavior change, for example depression state, by monitoring mental health status daily using MIMOSYS. When using MIMOSYS, it is desirable to avoid noise as much as possible during telephone calls because empirically at least 7 utterances or more are appropriate for calculating the vitality and it is difficult to correctly detect utterances if noise is contained in the voice. However, environmental sounds will be included when talking in a hands-free manner, and it may cause analyzed results of incorrect mental health status because of unreliable vitality. In this study, we investigate the impact of various noises on the mental health status output by our voice analysis method. We used the sound corpus CENSREC-1-C provided by Speech Resources Consortium in the experiment. This corpus consists of two kinds of data, the simulated data by the noise-addition and the recording data in real environments. One voice data is a numeric string vocalized with several intervals and includes nine or ten utterances. The simulated data includes eight kinds of noise, such as Subway, Babble, Car, Exhibition, Restaurant, Street, Airport and Station. In each noisy environment, noises at SNRs from 20dB to −5dB every 5 dB increments are artificially added to clean voice data without noise. The number of speaker is 104 in this data set. The real environmental data includes two real-noisy environments, such as the university restaurant and the vicinity of highway. In each real environment, there are two SNR conditions, the lower and higher SNR conditions. Furthermore, voice data was recorded with close microphone and remote microphone synchronously in real environments. The number of speaker is 10 in this data set, Voice analysis was performed for both voice data sets. We used only “vitality” for this research because the data have only one time point data. As a result for the simulated data, the mean of vitality values for voice data at SNR of 20dB was lower than it for clean voice data in each environment, and the means of vitality values for voice data at noise levels noisier than 20dB tended to show higher values in five environments (Car, Restaurant, Street, Airport and Station). As a result for the real environmental data, the mean of vitality values for voice data recorded with remote microphone was higher than it with close microphone in any combination of two noisy environments and two SNR conditions. On the other hand, utterances detected in the voice analysis were almost correctly detected for clean voice data and voice data at SNR of 20dB in all environments in the simulated data, but the detection accuracies at noise levels noisier than 20dB were poor markedly. The similar trend was shown in the real environmental data, that is, the detection accuracy of utterances for voice data recorded with remote microphone was poor markedly in any combination of two noisy environments and two SNR conditions. This means that a noise part is incorrectly recognized as an utterance and the voice analysis is performed including the false utterance. Therefore, it is possible that the mental health status cannot be calculated correctly under noisier environments. From the above results, the noisy environment at SNR of about 20dB is the limit of the analysis under some noise. Moreover, changes of emotions calculated by ST were also investigated in the simulated data. As a result, the means of joy and sorrow components of voice data at SNR of 20dB were higher than these means of clean voice data in all noisy environments, and the mean of anger component of voice data at SNR of 20dB was lower than that mean of clean voice data in all noisy environments. This indicates that not only the speech detection but also the noise itself affects the analysis result of the mental health status. In the future, we will increase the accuracy of utterance detection under noise environment and verify that the accuracy of mental health status analysis will improve with an appropriate noise filter.","PeriodicalId":122969,"journal":{"name":"2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIIBMS.2017.8279690","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In stressful modern society, mental health care is one of important issues. The authors have been developing methods to assess mental health status by voice. Analysis using voice has benefits such as, noninvasive, not necessary any specialized device, easy use, and remote-able monitoring. We focused on the pattern of voice frequency during in daily life telephone calls, and developed the Mind Monitoring System (MIMOSYS) which is the smartphone application to monitor the mental health status by voice during telephone calls. MIMOSYS uses voice emotion recognition technology (ST: Sensibility Technology) and outputs “Vitality” which is the indicator to denote the health status immediately after the telephone call and “Mental Activity” which is the indicator to denote the mid-to long-term health status. Higher vitality and Mental Activity values represent a better mental health status. We expect that the user can avoid behavior mental condition due to inducing behavior change, for example depression state, by monitoring mental health status daily using MIMOSYS. When using MIMOSYS, it is desirable to avoid noise as much as possible during telephone calls because empirically at least 7 utterances or more are appropriate for calculating the vitality and it is difficult to correctly detect utterances if noise is contained in the voice. However, environmental sounds will be included when talking in a hands-free manner, and it may cause analyzed results of incorrect mental health status because of unreliable vitality. In this study, we investigate the impact of various noises on the mental health status output by our voice analysis method. We used the sound corpus CENSREC-1-C provided by Speech Resources Consortium in the experiment. This corpus consists of two kinds of data, the simulated data by the noise-addition and the recording data in real environments. One voice data is a numeric string vocalized with several intervals and includes nine or ten utterances. The simulated data includes eight kinds of noise, such as Subway, Babble, Car, Exhibition, Restaurant, Street, Airport and Station. In each noisy environment, noises at SNRs from 20dB to −5dB every 5 dB increments are artificially added to clean voice data without noise. The number of speaker is 104 in this data set. The real environmental data includes two real-noisy environments, such as the university restaurant and the vicinity of highway. In each real environment, there are two SNR conditions, the lower and higher SNR conditions. Furthermore, voice data was recorded with close microphone and remote microphone synchronously in real environments. The number of speaker is 10 in this data set, Voice analysis was performed for both voice data sets. We used only “vitality” for this research because the data have only one time point data. As a result for the simulated data, the mean of vitality values for voice data at SNR of 20dB was lower than it for clean voice data in each environment, and the means of vitality values for voice data at noise levels noisier than 20dB tended to show higher values in five environments (Car, Restaurant, Street, Airport and Station). As a result for the real environmental data, the mean of vitality values for voice data recorded with remote microphone was higher than it with close microphone in any combination of two noisy environments and two SNR conditions. On the other hand, utterances detected in the voice analysis were almost correctly detected for clean voice data and voice data at SNR of 20dB in all environments in the simulated data, but the detection accuracies at noise levels noisier than 20dB were poor markedly. The similar trend was shown in the real environmental data, that is, the detection accuracy of utterances for voice data recorded with remote microphone was poor markedly in any combination of two noisy environments and two SNR conditions. This means that a noise part is incorrectly recognized as an utterance and the voice analysis is performed including the false utterance. Therefore, it is possible that the mental health status cannot be calculated correctly under noisier environments. From the above results, the noisy environment at SNR of about 20dB is the limit of the analysis under some noise. Moreover, changes of emotions calculated by ST were also investigated in the simulated data. As a result, the means of joy and sorrow components of voice data at SNR of 20dB were higher than these means of clean voice data in all noisy environments, and the mean of anger component of voice data at SNR of 20dB was lower than that mean of clean voice data in all noisy environments. This indicates that not only the speech detection but also the noise itself affects the analysis result of the mental health status. In the future, we will increase the accuracy of utterance detection under noise environment and verify that the accuracy of mental health status analysis will improve with an appropriate noise filter.