Multimodal Fusion for Segment Classification in Folk Music

2021 IEEE 18th India Council International Conference (INDICON) Pub Date : 2021-12-19 DOI:10.1109/INDICON52576.2021.9691751

Aravind Krishnan, Amal Vincent, Geevar Jos, R. Rajan

{"title":"Multimodal Fusion for Segment Classification in Folk Music","authors":"Aravind Krishnan, Amal Vincent, Geevar Jos, R. Rajan","doi":"10.1109/INDICON52576.2021.9691751","DOIUrl":null,"url":null,"abstract":"A folk music segment classification system that uses a multimodal fusion of acoustic features, textual information and duration based feature on Thiruvathirakali music corpus is proposed. Acoustic features are learned from musical texture features (MTF) using a long short term memory (LSTM) model. A term frequency-inverse document frequency (TF-IDF) model is employed to derive text-based features from transcription data. For multimodal fusion, early integration of the LSTM derived features, TF-IDF features and duration feature is employed. An attempt to optimise the LSTM model is carried out through frame fusion in the temporal domain. Frame fusion is seen to increase classification efficiency by 13 percent and reduce computational expense by tenfold. The system reports an overall precision, recall and F1 measure of 0.53, 0.52 and 0.51 respectively for an LSTM model with frame fusion, with better performance over a baseline SVM classifier. The classification efficiency is seen to improve by 15 percent (absolutely) with the addition of each multimodal component. For complete multimodal fusion, the metrics improve to 0.83, 0.78 and 0.80 respectively.","PeriodicalId":106004,"journal":{"name":"2021 IEEE 18th India Council International Conference (INDICON)","volume":"800 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 18th India Council International Conference (INDICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDICON52576.2021.9691751","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

A folk music segment classification system that uses a multimodal fusion of acoustic features, textual information and duration based feature on Thiruvathirakali music corpus is proposed. Acoustic features are learned from musical texture features (MTF) using a long short term memory (LSTM) model. A term frequency-inverse document frequency (TF-IDF) model is employed to derive text-based features from transcription data. For multimodal fusion, early integration of the LSTM derived features, TF-IDF features and duration feature is employed. An attempt to optimise the LSTM model is carried out through frame fusion in the temporal domain. Frame fusion is seen to increase classification efficiency by 13 percent and reduce computational expense by tenfold. The system reports an overall precision, recall and F1 measure of 0.53, 0.52 and 0.51 respectively for an LSTM model with frame fusion, with better performance over a baseline SVM classifier. The classification efficiency is seen to improve by 15 percent (absolutely) with the addition of each multimodal component. For complete multimodal fusion, the metrics improve to 0.83, 0.78 and 0.80 respectively.

查看原文本刊更多论文

多模态融合在民乐音段分类中的应用

基于Thiruvathirakali音乐语料库，提出了一种基于声学特征、文本信息和音长特征的多模态融合民乐片段分类系统。使用长短期记忆(LSTM)模型从音乐织体特征(MTF)中学习声学特征。采用术语频率-逆文档频率(TF-IDF)模型从转录数据中导出基于文本的特征。对于多模态融合，采用LSTM衍生特征、TF-IDF特征和持续时间特征的早期融合。通过时域帧融合对LSTM模型进行了优化。框架融合被认为可以将分类效率提高13%，并将计算费用降低10倍。该系统报告了具有帧融合的LSTM模型的总体精度、召回率和F1测度分别为0.53、0.52和0.51，比基线SVM分类器性能更好。随着每个多模态成分的增加，分类效率被认为提高了15%(绝对)。对于完全的多模态融合，该指标分别提高到0.83、0.78和0.80。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE 18th India Council International Conference (INDICON)

自引率

0.00%

发文量