{"title":"Guest editorial: Music perception and cognition in music technology","authors":"Zijin Li, Stephen McAdams","doi":"10.1049/ccs2.12066","DOIUrl":null,"url":null,"abstract":"<p>There has been a remarkably increasing interest in music technology in the past few years, which is a multi-disciplinary overlapping research area. It involves digital signal processing, acoustics, mechanics, computer science, electronic engineering, artificial intelligence psychophysiology, cognitive neuroscience and music performance, theory and analysis. Among these sub-domains of music technology, Music Perception and Cognition are important parts of Computational Musicology as <i>Musiking</i> is a whole activity from music noumenon to being perceived and cognised by human beings. In addition to the calculation of basic elements of music itself, such as rhythm, pitch, timbre, harmony and structure, the perception of music to the human ear and the creative cognitive process should gain more attention from researchers because it serves as a bridge to join the humanity and technology.</p><p>Music perception exists in almost every aspect related to music, such as composing, playing, improvising, performing, teaching and learning. It is so comprehensive that a range of disciplines, including cognitive musicology, musical timbre perception, music emotions, acoustics, audio-based music signal processing, music interactive, cognitive modelling and music information retrieval, can be incorporated.</p><p>This special issue aims to bring together humanity and technology scientists in music technology in areas such as music performance art, creativity, computer science, experimental psychology, and cognitive science. It is composed of 10 outstanding contributions covering auditory attention selection behaviours, emotional music generation, instrument and performance skills recognition, perception and musical elements, music educational robots, affective computing, music-related social behaviour, and cross-cultural music dataset.</p><p>Li et al. studied the automatic recognition of traditional Chinese musical instrument audio. Specifically in the instrument type identification experiment, Mel-spectrum is used as input, and an 8-layer convolutional neural network is trained. This configuration achieves 99.3% accuracy; in the performance skills recognition experiments respectively conducted on single-instrument level and same-kind instruments level where the regularity of the same playing technique of different instruments can be utilised. The recognition accuracy of the four kinds of instruments is as follows: 95.7% for blowing instruments, 82.2% for plucked string instruments, 88.3% for strings instruments, and 97.5% for percussion instruments with a similar training procedure configuration.</p><p>Wang et al. used a cross-cultural approach to explore the correlations between perception and musical elements by comparing music emotion recognition models. In this approach, the participants are asked to rate valence, tension arousal and energy arousal on labelled nine-point analogical-categorical scales for four types of classical music: Chinese ensemble, Chinese solo, Western ensemble and Western solo. Fifteen musical elements in five categories—timbre, rhythm, articulation, dynamics and register were annotated through manual evaluation or the automatic algorithm. Results showed that tempo, rhythm complexity, and articulation are culturally universal, but musical elements related to timbre, register and dynamic features are culturally specific.</p><p>Du et al. proposed a multi-scale ASA model based on the binary Logit model by referencing the information value and saliency-driven factors of the listener's attention behaviour. The experiment for verification showed that the proposed ASA model was an effectively predicted human selective auditory attention feature. The improvement of the proposed ASA model with auditory attention research studies and traditional attention models is embodied in cognitive specialties that coincide more with the authentic auditory attention process and its application in the practical HMS optimisation. Furthermore, by adopting the proposed ASA model, auditory attention behaviour can be predicted before the task. This will help researchers analyse listeners' behaviours and evaluate the ergonomics in the ‘cocktail party effect’ environment.</p><p>Ma et al. proposed an emotional music generation model considering the structure features along with its emotional label. Specifically, the emotional labels with music structure features are embedded as the conditional input, a conditional generative GRU model is used for generating music in an auto-regressive manner and a perceptual loss is optimised with cross-entropy loss during the training procedure. Furthermore, both the subjective and objective experiments prove that their model can generate emotional music correlated to the specified emotion and music structures.</p><p>Jiang et al. analysed the mechanism of sound production in terms of the coupling of the edge tone and the air column's vibration in the tube. It was found through numerical simulations that the oscillation frequency of the edge tone increases with the jet velocity and jumps to another higher stage at certain values, and the dominant modes can be altered by varying the impinging jet angle. Furthermore, the tonal quality of the flue pipe is demonstrated to be dependent upon the changes in the oscillation frequency of the edge tone by the experiments of a musical pipe model. Greater amplitude and higher dominant frequencies are shown in the acoustic response of the flue pipe when increasing the jet velocity. With these properties, the flutist will obtain subtle variations in the perceived tonal quality through adjustment of the blowing velocity during the attack transient.</p><p>Li et al. presented the design and development of a virtual fretless Chinese stringed instrument App by taking the Duxianqin as an example. The digital simulation of fretless musical instruments consists of simulation of the continuous pitch processing of the strings, and the simulation of the sound produced by plucking strings. Focussing on the mechanics and wave theory, they obtain the quantitative relationship between string frequency and its deformation and elongation and use physical acoustic theory to quantitatively restore the way of playing musical instruments.</p><p>Zhang et al. proposed an optimising method for automatic determination of vocal tract linear prediction analysis order that follows the specific situation of different voicing scenes based on Iterative Adaptive Inverse Filtering (IAIF). They aim at obtaining a more accurate glottal wave from speech or singing voice signal in a non-invasive way. Compared with existing methods that use a fixed experience order, their proposed method can achieve up to 8.41% improvement in correlation coefficient with the real glottal wave.</p><p>Chen et al. constructed the first labelled extensive Music Video (MV) dataset, Next-MV consisting of 6000 pieces of 30-s MV fragments annotated with five music style labels and four cultural labels. Furthermore, they propose a Next-Net framework to study the correlation between the music style and visual style. The experimental accuracy reached 71.1% and the accuracy of the general fusion model in a cross-cultural experiment is between the model trained by within-dataset and by cross-dataset. It shows that culture has a significant influence on the correlation between music and visual.</p><p>Zhang et al. proposed a pipeline for performing a perceptual survey which is designed to explore how different musical elements influence people's perception of ‘Chinese style’ in music. Participants with various backgrounds were presented with categorised music excerpts performed in the Erhu or violin and then gave ‘Chinese style’ ratings. Statistical analysis indicates that music content contributes more than instruments in general, and musicians showed higher sensibility to both music content and instruments, and their responses are more concentrated than non-musicians. Furthermore, a supplementary automatic music classification experiment is conducted in comparison with the survey results to discuss the authors' choice of stimuli in the survey and similarities between computer auditory and human perception.</p><p>Chen et al. derived a new research model based on the environmental psychology model in the literature and designed an empirical experiment to examine changes in consumers' non-behavioural shopping outcomes under different conditions. Specifically, they build a virtual shopping website and chose the Mid-Autumn Festival as the experimental scenario in which a questionnaire is used to measure the differences in dependent variables formed by different treatments. The results show that the background music helps more positive shopping experiences regardless of its theme.</p><p>Xie et al. proposed an evaluation method of aesthetic categories of Chinese traditional music, established a dataset composed of 500 clips of five aesthetics categories and analysed the distribution characteristics of different aesthetic categories in the emotional dimension space. Furthermore, they tested the accuracy of different classifiers for aesthetic classification on this dataset by extracting corresponding acoustical features, and the highest classification accuracy was 65.37% by logistic regression.</p><p>Wang et al. proposed a subjective user study on the hardness of drum sound by taking the Bass Drum as an example. They studied the impact of different audio effects on the perception of hardness of the Bass Drum. The results show that appropriate low-frequency and high-frequency excitation processing will respectively weaken and increase the ear's perception of the hardness of the Bass Drum and the change of this perception is obvious. However, properly raising the base frequency of the Bass Drum or changing the sound envelope of the Bass Drum to create a faster ‘attack’ can increase the ear's perception of the hardness of the Bass Drum, but the degree of this perception is not obvious. Furthermore, changing the frequency and changing the envelope affect each other, and their interaction is also the main reason for changing the human ear's perception of the hardness of the Bass Drum.</p><p>All the papers selected for this Special Issue show it's important for music perception to music technology improvement. Most of the papers contain real-world validation with experimental data, and most of them contain and demonstrate innovative system design and processing solutions. In the meanwhile, there are still many challenges in this field that require future research attention. The future research work can help the potential of music technology extend its application and accelerate market adoption and application.</p><p>We would like to express our gratitude and congratulations to all the authors of the selected papers in this Special Issue of <i>IET Music Perception and Cognition in Music Technology</i> for their contributions of great value in terms of quality and innovation. We also thank all the reviewers for their contribution to the selection and improvement process of the publications in this Special Issue. Our hope is that this Special Issue will stimulate researchers in both industry and academia to undertake further research in this challenging field. We are also grateful to the <i>IET Cognitive Computation and Systems</i> Editor-in-Chief and the Editorial office for their support throughout the editorial process.</p>","PeriodicalId":33652,"journal":{"name":"Cognitive Computation and Systems","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/ccs2.12066","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation and Systems","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/ccs2.12066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
There has been a remarkably increasing interest in music technology in the past few years, which is a multi-disciplinary overlapping research area. It involves digital signal processing, acoustics, mechanics, computer science, electronic engineering, artificial intelligence psychophysiology, cognitive neuroscience and music performance, theory and analysis. Among these sub-domains of music technology, Music Perception and Cognition are important parts of Computational Musicology as Musiking is a whole activity from music noumenon to being perceived and cognised by human beings. In addition to the calculation of basic elements of music itself, such as rhythm, pitch, timbre, harmony and structure, the perception of music to the human ear and the creative cognitive process should gain more attention from researchers because it serves as a bridge to join the humanity and technology.
Music perception exists in almost every aspect related to music, such as composing, playing, improvising, performing, teaching and learning. It is so comprehensive that a range of disciplines, including cognitive musicology, musical timbre perception, music emotions, acoustics, audio-based music signal processing, music interactive, cognitive modelling and music information retrieval, can be incorporated.
This special issue aims to bring together humanity and technology scientists in music technology in areas such as music performance art, creativity, computer science, experimental psychology, and cognitive science. It is composed of 10 outstanding contributions covering auditory attention selection behaviours, emotional music generation, instrument and performance skills recognition, perception and musical elements, music educational robots, affective computing, music-related social behaviour, and cross-cultural music dataset.
Li et al. studied the automatic recognition of traditional Chinese musical instrument audio. Specifically in the instrument type identification experiment, Mel-spectrum is used as input, and an 8-layer convolutional neural network is trained. This configuration achieves 99.3% accuracy; in the performance skills recognition experiments respectively conducted on single-instrument level and same-kind instruments level where the regularity of the same playing technique of different instruments can be utilised. The recognition accuracy of the four kinds of instruments is as follows: 95.7% for blowing instruments, 82.2% for plucked string instruments, 88.3% for strings instruments, and 97.5% for percussion instruments with a similar training procedure configuration.
Wang et al. used a cross-cultural approach to explore the correlations between perception and musical elements by comparing music emotion recognition models. In this approach, the participants are asked to rate valence, tension arousal and energy arousal on labelled nine-point analogical-categorical scales for four types of classical music: Chinese ensemble, Chinese solo, Western ensemble and Western solo. Fifteen musical elements in five categories—timbre, rhythm, articulation, dynamics and register were annotated through manual evaluation or the automatic algorithm. Results showed that tempo, rhythm complexity, and articulation are culturally universal, but musical elements related to timbre, register and dynamic features are culturally specific.
Du et al. proposed a multi-scale ASA model based on the binary Logit model by referencing the information value and saliency-driven factors of the listener's attention behaviour. The experiment for verification showed that the proposed ASA model was an effectively predicted human selective auditory attention feature. The improvement of the proposed ASA model with auditory attention research studies and traditional attention models is embodied in cognitive specialties that coincide more with the authentic auditory attention process and its application in the practical HMS optimisation. Furthermore, by adopting the proposed ASA model, auditory attention behaviour can be predicted before the task. This will help researchers analyse listeners' behaviours and evaluate the ergonomics in the ‘cocktail party effect’ environment.
Ma et al. proposed an emotional music generation model considering the structure features along with its emotional label. Specifically, the emotional labels with music structure features are embedded as the conditional input, a conditional generative GRU model is used for generating music in an auto-regressive manner and a perceptual loss is optimised with cross-entropy loss during the training procedure. Furthermore, both the subjective and objective experiments prove that their model can generate emotional music correlated to the specified emotion and music structures.
Jiang et al. analysed the mechanism of sound production in terms of the coupling of the edge tone and the air column's vibration in the tube. It was found through numerical simulations that the oscillation frequency of the edge tone increases with the jet velocity and jumps to another higher stage at certain values, and the dominant modes can be altered by varying the impinging jet angle. Furthermore, the tonal quality of the flue pipe is demonstrated to be dependent upon the changes in the oscillation frequency of the edge tone by the experiments of a musical pipe model. Greater amplitude and higher dominant frequencies are shown in the acoustic response of the flue pipe when increasing the jet velocity. With these properties, the flutist will obtain subtle variations in the perceived tonal quality through adjustment of the blowing velocity during the attack transient.
Li et al. presented the design and development of a virtual fretless Chinese stringed instrument App by taking the Duxianqin as an example. The digital simulation of fretless musical instruments consists of simulation of the continuous pitch processing of the strings, and the simulation of the sound produced by plucking strings. Focussing on the mechanics and wave theory, they obtain the quantitative relationship between string frequency and its deformation and elongation and use physical acoustic theory to quantitatively restore the way of playing musical instruments.
Zhang et al. proposed an optimising method for automatic determination of vocal tract linear prediction analysis order that follows the specific situation of different voicing scenes based on Iterative Adaptive Inverse Filtering (IAIF). They aim at obtaining a more accurate glottal wave from speech or singing voice signal in a non-invasive way. Compared with existing methods that use a fixed experience order, their proposed method can achieve up to 8.41% improvement in correlation coefficient with the real glottal wave.
Chen et al. constructed the first labelled extensive Music Video (MV) dataset, Next-MV consisting of 6000 pieces of 30-s MV fragments annotated with five music style labels and four cultural labels. Furthermore, they propose a Next-Net framework to study the correlation between the music style and visual style. The experimental accuracy reached 71.1% and the accuracy of the general fusion model in a cross-cultural experiment is between the model trained by within-dataset and by cross-dataset. It shows that culture has a significant influence on the correlation between music and visual.
Zhang et al. proposed a pipeline for performing a perceptual survey which is designed to explore how different musical elements influence people's perception of ‘Chinese style’ in music. Participants with various backgrounds were presented with categorised music excerpts performed in the Erhu or violin and then gave ‘Chinese style’ ratings. Statistical analysis indicates that music content contributes more than instruments in general, and musicians showed higher sensibility to both music content and instruments, and their responses are more concentrated than non-musicians. Furthermore, a supplementary automatic music classification experiment is conducted in comparison with the survey results to discuss the authors' choice of stimuli in the survey and similarities between computer auditory and human perception.
Chen et al. derived a new research model based on the environmental psychology model in the literature and designed an empirical experiment to examine changes in consumers' non-behavioural shopping outcomes under different conditions. Specifically, they build a virtual shopping website and chose the Mid-Autumn Festival as the experimental scenario in which a questionnaire is used to measure the differences in dependent variables formed by different treatments. The results show that the background music helps more positive shopping experiences regardless of its theme.
Xie et al. proposed an evaluation method of aesthetic categories of Chinese traditional music, established a dataset composed of 500 clips of five aesthetics categories and analysed the distribution characteristics of different aesthetic categories in the emotional dimension space. Furthermore, they tested the accuracy of different classifiers for aesthetic classification on this dataset by extracting corresponding acoustical features, and the highest classification accuracy was 65.37% by logistic regression.
Wang et al. proposed a subjective user study on the hardness of drum sound by taking the Bass Drum as an example. They studied the impact of different audio effects on the perception of hardness of the Bass Drum. The results show that appropriate low-frequency and high-frequency excitation processing will respectively weaken and increase the ear's perception of the hardness of the Bass Drum and the change of this perception is obvious. However, properly raising the base frequency of the Bass Drum or changing the sound envelope of the Bass Drum to create a faster ‘attack’ can increase the ear's perception of the hardness of the Bass Drum, but the degree of this perception is not obvious. Furthermore, changing the frequency and changing the envelope affect each other, and their interaction is also the main reason for changing the human ear's perception of the hardness of the Bass Drum.
All the papers selected for this Special Issue show it's important for music perception to music technology improvement. Most of the papers contain real-world validation with experimental data, and most of them contain and demonstrate innovative system design and processing solutions. In the meanwhile, there are still many challenges in this field that require future research attention. The future research work can help the potential of music technology extend its application and accelerate market adoption and application.
We would like to express our gratitude and congratulations to all the authors of the selected papers in this Special Issue of IET Music Perception and Cognition in Music Technology for their contributions of great value in terms of quality and innovation. We also thank all the reviewers for their contribution to the selection and improvement process of the publications in this Special Issue. Our hope is that this Special Issue will stimulate researchers in both industry and academia to undertake further research in this challenging field. We are also grateful to the IET Cognitive Computation and Systems Editor-in-Chief and the Editorial office for their support throughout the editorial process.