HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural network

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Akshi Kumar, Abhishek Mallik, Sanjay Kumar
{"title":"HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural network","authors":"Akshi Kumar, Abhishek Mallik, Sanjay Kumar","doi":"10.1145/3661306","DOIUrl":null,"url":null,"abstract":"<p>Humour is a crucial aspect of human speech, and it is, therefore, imperative to create a system that can offer such detection. While data regarding humour in English speech is plentiful, the same cannot be said for a low-resource language like Hindi. Through this paper, we introduce two multimodal datasets for humour detection in the Hindi web series. The dataset was collected from over 500 minutes of conversations amongst the characters of the Hindi web series \\(Kota-Factory\\) and \\(Panchayat\\). Each dialogue is manually annotated as Humour or Non-Humour. Along with presenting a new Hindi language-based Humour detection dataset, we propose an improved framework for detecting humour in Hindi conversations. We start by preprocessing both datasets to obtain uniformity across the dialogues and datasets. The processed dialogues are then passed through the Skip-gram model for generating Hindi word embedding. The generated Hindi word embedding is then passed onto three convolutional neural network (CNN) architectures simultaneously, each having a different filter size for feature extraction. The extracted features are then passed through stacked Long Short-Term Memory (LSTM) layers for further processing and finally classifying the dialogues as Humour or Non-Humour. We conduct intensive experiments on both proposed Hindi datasets and evaluate several standard performance metrics. The performance of our proposed framework was also compared with several baselines and contemporary algorithms for Humour detection. The results demonstrate the effectiveness of our dataset to be used as a standard dataset for Humour detection in the Hindi web series. The proposed model yields an accuracy of 91.79 and 87.32 while an F1 score of 91.64 and 87.04 in percentage for the \\(Kota-Factory\\) and \\(Panchayat\\) datasets, respectively.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"7 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3661306","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Humour is a crucial aspect of human speech, and it is, therefore, imperative to create a system that can offer such detection. While data regarding humour in English speech is plentiful, the same cannot be said for a low-resource language like Hindi. Through this paper, we introduce two multimodal datasets for humour detection in the Hindi web series. The dataset was collected from over 500 minutes of conversations amongst the characters of the Hindi web series \(Kota-Factory\) and \(Panchayat\). Each dialogue is manually annotated as Humour or Non-Humour. Along with presenting a new Hindi language-based Humour detection dataset, we propose an improved framework for detecting humour in Hindi conversations. We start by preprocessing both datasets to obtain uniformity across the dialogues and datasets. The processed dialogues are then passed through the Skip-gram model for generating Hindi word embedding. The generated Hindi word embedding is then passed onto three convolutional neural network (CNN) architectures simultaneously, each having a different filter size for feature extraction. The extracted features are then passed through stacked Long Short-Term Memory (LSTM) layers for further processing and finally classifying the dialogues as Humour or Non-Humour. We conduct intensive experiments on both proposed Hindi datasets and evaluate several standard performance metrics. The performance of our proposed framework was also compared with several baselines and contemporary algorithms for Humour detection. The results demonstrate the effectiveness of our dataset to be used as a standard dataset for Humour detection in the Hindi web series. The proposed model yields an accuracy of 91.79 and 87.32 while an F1 score of 91.64 and 87.04 in percentage for the \(Kota-Factory\) and \(Panchayat\) datasets, respectively.

HumourHindiNet:利用单词嵌入和卷积神经网络检测印地语网络剧中的幽默内容
幽默是人类语言的一个重要方面,因此,创建一个能够提供幽默检测的系统势在必行。虽然有关英语语音中幽默的数据非常丰富,但像印地语这样的低资源语言却不尽相同。通过本文,我们介绍了两个用于检测印地语网络系列中幽默的多模态数据集。该数据集是从印地语网络系列剧(Kota-Factory)和(Panchayat)角色之间超过 500 分钟的对话中收集的。每段对话都被人工标注为 "幽默 "或 "非幽默"。在提出基于印地语的新幽默检测数据集的同时,我们还提出了一个改进的框架,用于检测印地语对话中的幽默。我们首先对两个数据集进行预处理,以获得对话和数据集的一致性。然后,将处理过的对话通过 Skip-gram 模型生成印地语单词嵌入。生成的印地语单词嵌入会同时传递给三个卷积神经网络(CNN)架构,每个架构都有不同的滤波器大小,用于提取特征。提取的特征随后通过堆叠的长短期记忆(LSTM)层进行进一步处理,并最终将对话分类为 "幽默"(Humour)或 "非幽默"(Non-Humour)。我们在两个拟议的印地语数据集上进行了深入实验,并评估了多个标准性能指标。我们提出的框架的性能还与若干基准和当代幽默检测算法进行了比较。结果表明,我们的数据集可以有效地用作印地语网络系列中幽默检测的标准数据集。所提出的模型在(Kota-Factory)和(Panchayat)数据集上的准确率分别为91.79和87.32,F1得分分别为91.64和87.04。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.60
自引率
15.00%
发文量
241
期刊介绍: The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信