A dataset for classifying phrases and sentences into statements, questions, or exclamations based on sound pitch

IF 1 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-06-24 DOI:10.1016/j.dib.2025.111826

Ayub Othman Abdulrahman, Shanga Ismail Othman, Gazo Badran Yasin, Meer Salam Ali

{"title":"A dataset for classifying phrases and sentences into statements, questions, or exclamations based on sound pitch","authors":"Ayub Othman Abdulrahman, Shanga Ismail Othman, Gazo Badran Yasin, Meer Salam Ali","doi":"10.1016/j.dib.2025.111826","DOIUrl":null,"url":null,"abstract":"<div><div>Speech is the most fundamental and sophisticated channel of human communication, and breakthroughs in Natural Language Processing (NLP) have substantially raised the quality of human-computer interaction. In particular, new wave of deep learning methods have significantly advanced human speech recognition by obtaining fine-grained acoustic cues including pitch, an acoustic feature that can be a critical ingredient in understanding communicative intent. Pitch variation is in particular important for prosodic classification tasks (i.e., statements, questions, and exclamations), which is crucial in tonal and low resource languages such as Kurdish, where intonation holds significant semantic information. This paper presents the dataset of the Statements, Questions, or Exclamations Based on Sound Pitch (SQEBSP) which contains 12,660 professionally-recorded speech audio clips by 431 native Kurdish speakers who reside in the Kurdistan Region of Iraq.</div><div>Regarding utterances, 10 new phrases were articulated by each speaker per three prosodic categories: statements, questions, and exclamations. All utterances were digitized at 16 kHz and then manually checked for correctness concerning pitch-based classification. The dataset contains equal representation from all three classes, about 4200 samples per class, and metadata such as speaker gender, age group, and sentence identifiers.</div><div>The original audio files, alongside resources like Mel-Frequency Cepstral Coefficients (MFCCs) and waveform visualizations, can be found on Mendeley Data. The dataset offered has significant advantages for formulating and testing pitch-based speech classification algorithms, furthers the work on pronunciation modelling for languages lacking sufficient resources. It furthermore, aids in developing speech technologies sensitive to dialects.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111826"},"PeriodicalIF":1.0000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925005530","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Speech is the most fundamental and sophisticated channel of human communication, and breakthroughs in Natural Language Processing (NLP) have substantially raised the quality of human-computer interaction. In particular, new wave of deep learning methods have significantly advanced human speech recognition by obtaining fine-grained acoustic cues including pitch, an acoustic feature that can be a critical ingredient in understanding communicative intent. Pitch variation is in particular important for prosodic classification tasks (i.e., statements, questions, and exclamations), which is crucial in tonal and low resource languages such as Kurdish, where intonation holds significant semantic information. This paper presents the dataset of the Statements, Questions, or Exclamations Based on Sound Pitch (SQEBSP) which contains 12,660 professionally-recorded speech audio clips by 431 native Kurdish speakers who reside in the Kurdistan Region of Iraq.

Regarding utterances, 10 new phrases were articulated by each speaker per three prosodic categories: statements, questions, and exclamations. All utterances were digitized at 16 kHz and then manually checked for correctness concerning pitch-based classification. The dataset contains equal representation from all three classes, about 4200 samples per class, and metadata such as speaker gender, age group, and sentence identifiers.

The original audio files, alongside resources like Mel-Frequency Cepstral Coefficients (MFCCs) and waveform visualizations, can be found on Mendeley Data. The dataset offered has significant advantages for formulating and testing pitch-based speech classification algorithms, furthers the work on pronunciation modelling for languages lacking sufficient resources. It furthermore, aids in developing speech technologies sensitive to dialects.

查看原文本刊更多论文

一个基于音高将短语和句子分类为陈述句、疑问句或感叹词的数据集

语音是人类交流最基本、最复杂的渠道，自然语言处理（NLP）的突破大大提高了人机交互的质量。特别是，新一波的深度学习方法通过获得细粒度的声音线索，包括音调，这是一种声学特征，可以成为理解交流意图的关键因素，从而显著地推进了人类语音识别。音调变化对于韵律分类任务（即陈述句、疑问句和感叹词）尤其重要，这在声调和低资源语言（如库尔德语）中至关重要，因为语调包含重要的语义信息。本文介绍了基于音高的语句、疑问或感叹词（SQEBSP）数据集，其中包含由居住在伊拉克库尔德斯坦地区的431名母语库尔德人录制的12,660个专业录音语音片段。在话语方面，每个说话者按照陈述句、疑问句和感叹词这三种韵律类别，分别表达了10个新短语。所有的话语都以16千赫的频率数字化，然后人工检查基于音高的分类的正确性。该数据集包含来自所有三个类的相同表示，每个类大约4200个样本，以及诸如说话者性别、年龄组和句子标识符等元数据。原始音频文件，以及Mel-Frequency倒谱系数（MFCCs）和波形可视化等资源，可以在Mendeley Data上找到。所提供的数据集在基于音高的语音分类算法的制定和测试方面具有显著的优势，进一步推动了缺乏足够资源的语言的语音建模工作。此外，它还有助于开发对方言敏感的语音技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.