Data in BriefPub Date : 2025-06-24DOI: 10.1016/j.dib.2025.111826
Ayub Othman Abdulrahman, Shanga Ismail Othman, Gazo Badran Yasin, Meer Salam Ali
{"title":"A dataset for classifying phrases and sentences into statements, questions, or exclamations based on sound pitch","authors":"Ayub Othman Abdulrahman, Shanga Ismail Othman, Gazo Badran Yasin, Meer Salam Ali","doi":"10.1016/j.dib.2025.111826","DOIUrl":"10.1016/j.dib.2025.111826","url":null,"abstract":"<div><div>Speech is the most fundamental and sophisticated channel of human communication, and breakthroughs in Natural Language Processing (NLP) have substantially raised the quality of human-computer interaction. In particular, new wave of deep learning methods have significantly advanced human speech recognition by obtaining fine-grained acoustic cues including pitch, an acoustic feature that can be a critical ingredient in understanding communicative intent. Pitch variation is in particular important for prosodic classification tasks (i.e., statements, questions, and exclamations), which is crucial in tonal and low resource languages such as Kurdish, where intonation holds significant semantic information. This paper presents the dataset of the Statements, Questions, or Exclamations Based on Sound Pitch (SQEBSP) which contains 12,660 professionally-recorded speech audio clips by 431 native Kurdish speakers who reside in the Kurdistan Region of Iraq.</div><div>Regarding utterances, 10 new phrases were articulated by each speaker per three prosodic categories: statements, questions, and exclamations. All utterances were digitized at 16 kHz and then manually checked for correctness concerning pitch-based classification. The dataset contains equal representation from all three classes, about 4200 samples per class, and metadata such as speaker gender, age group, and sentence identifiers.</div><div>The original audio files, alongside resources like Mel-Frequency Cepstral Coefficients (MFCCs) and waveform visualizations, can be found on Mendeley Data. The dataset offered has significant advantages for formulating and testing pitch-based speech classification algorithms, furthers the work on pronunciation modelling for languages lacking sufficient resources. It furthermore, aids in developing speech technologies sensitive to dialects.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111826"},"PeriodicalIF":1.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-24DOI: 10.1016/j.dib.2025.111829
Yang Wang , Mengru Zhao , Zhe Wang , Xiaohong Luo , Chengfei Wang , Baoyuan Guo
{"title":"Whole genome sequence data of Comamonas sediminis FS4_11, a fumonisin B1-transforming bacterium, using hybrid nanopore-illumina sequencing","authors":"Yang Wang , Mengru Zhao , Zhe Wang , Xiaohong Luo , Chengfei Wang , Baoyuan Guo","doi":"10.1016/j.dib.2025.111829","DOIUrl":"10.1016/j.dib.2025.111829","url":null,"abstract":"<div><div>The genome of Comamonas sediminis FS4_11, a bacterial strain with mycotoxin fumonisin B1 (FB1) transformation capability, was sequenced using Oxford Nanopore Technologies (ONT) and Illumina platforms. The final assembly generated a circular chromosome of 5,148,490 bp with a mean G+C content of 63.74%, representing a contiguous genomic structure. Genome annotation predicted 4565 protein-coding sequences (CDSs), 82 transfer RNAs (tRNAs), 18 ribosomal RNAs (rRNAs; 6 each of 5S, 16S, and 23S rRNA), 1 transfer-messenger RNA (tmRNA), and 8 pseudogenes and other non-coding RNAs. Functional annotation identified 939 potential virulence factors, two putative AdeF-related antibiotic resistance genes, 1486 potential pathogen-host interaction proteins, and a candidate carboxylesterase for FB1 transformation. This dataset primarily aids in identifying potential FB1 detoxification enzyme genes and assessing strain biosafety. It also offers significant reuse potential for comparative genomics and understanding bacterial evolution.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111829"},"PeriodicalIF":1.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144518481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dataset on energy consumption in buildings within tropical climate based on design aspects of courtyards","authors":"Abdulbasit Almhafdy , Ashjan Al-Mutairi , Asma Al-Shargabi , Amal Al-Shargabi","doi":"10.1016/j.dib.2025.111834","DOIUrl":"10.1016/j.dib.2025.111834","url":null,"abstract":"<div><div>Sustainability and energy efficiency have become fundamental objectives for modern society. Green roofs and facades are increasingly recognized as innovative and sustainable strategies to improve the energy performance of buildings. This paper introduces a dataset about buildings thermal performance and energy consumption in tropical climate depending on adjacent outdoor enclosed courtyards design features with different architectural shapes U, L, and O. The core data has been collected in public building in Kuala Lumpur, Malaysia. Then it expanded using simulation. The core measured raw data is the temperature and the other data is simulated and/or calculated. The dataset includes detailed design features of courtyards such as plan aspect ratio, number of floors, and orientation. Measurement instruments were calibrated against real-world measurements to ensure accuracy and reliability. The simulated data is tested and validated based on the statistical aspects of the raw data using Pearson correlation coefficient, with a value of 0.882. The dataset includes total 8,685 records across the different courtyard' shapes. This dataset captures intricate relationships between architectural design parameters and energy consumption, making it a valuable resource for architects, engineers, and researchers interested in optimizing building designs for improved energy efficiency. It also allows in-depth analysis and potential reuse in studies related to sustainable architecture and urban planning.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111834"},"PeriodicalIF":1.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-24DOI: 10.1016/j.dib.2025.111828
Diego Miranda , Carlos Escobedo , Dayana Palma , Rene Noel , Adrián Fernández , Cristian Cechinel , Jaime Godoy , Roberto Munoz
{"title":"A multimodal experimental dataset on agile software development team interactions","authors":"Diego Miranda , Carlos Escobedo , Dayana Palma , Rene Noel , Adrián Fernández , Cristian Cechinel , Jaime Godoy , Roberto Munoz","doi":"10.1016/j.dib.2025.111828","DOIUrl":"10.1016/j.dib.2025.111828","url":null,"abstract":"<div><div>Studying collaborative dynamics in agile development teams requires multi- modal data that captures verbal and non-verbal communication. However, few experimental datasets provide this level of depth in real or simulated teamwork contexts. This article presents a multimodal dataset with experimental data collected during controlled sessions involving simulated agile development teams, each composed of four computer science students. A total of 19 groups (76 different participants) were organized, each participating in two collaborative activities: one without a coordination technique and another using the Planning Poker method. Three of these teams were designated as control groups. The resulting dataset includes audio recordings of verbal interactions and non- verbal behaviour data, such as body posture, facial expressions, visual attention, and gestures, captured using MediaPipe, YOLOv8, and DeepSort. It also contains time-aligned automatic transcriptions generated with WhisperX, attention logs, mimicry labels, and surveys on perceived equity in interactions. This re- source aims to provide a comprehensive view of collaborative behaviour in agile contexts, supporting both qualitative analysis of interactions and the development of predictive models of group performance. The dataset explores how shared visual attention and behavioural synchrony influence team effectiveness and decision-making through this multimodal approach. This work contributes a unique dataset valuable to researchers across multiple fields of study.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111828"},"PeriodicalIF":1.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-24DOI: 10.1016/j.dib.2025.111825
Pushpa B․ R․ , Manohar N․ , N. Shobha Rani
{"title":"StarNet: Indian star gooseberries dataset for quality and maturity assessment","authors":"Pushpa B․ R․ , Manohar N․ , N. Shobha Rani","doi":"10.1016/j.dib.2025.111825","DOIUrl":"10.1016/j.dib.2025.111825","url":null,"abstract":"<div><div>Star gooseberry provides immense health benefits and is widely recognized in the Indian medicinal system. It holds significant importance in the food production, pharmaceuticals, and cosmetics industries due to the presence of therapeutic and pharmacological properties. Due to its beneficial properties, gooseberry fruit is widely used in treating various ailments. Therefore, cultivating these fruits presents an opportunity to generate revenue, benefiting both farmers and the agricultural sector. The post-harvest process of fruit typically performs the quality assessment by segregating fruits based on visual characteristics, which is tedious and prone to human error. Hence, there is a need to develop an automated computer vision model to assess the fruit quality more accurately. This study focuses on dataset collection, including image samples of both single and multiple-star gooseberry fruits to automate fruit grading. This dataset has been specifically developed for research purposes, contributing to fruit detection, quality assessment, weight estimation, and classification of fruits at various ripeness stages. Further, it provides researchers with an opportunity to develop an automated system for detecting overlapping fruits and touching contours using machine learning, deep learning, and computer vision systems. Image samples of star gooseberry at different growth stages were collected from orchids in Mysuru, India. The dataset, named “AmlaNet” comprises 792 image samples of star gooseberry, captured against a plain background from varying angles, sizes, brightness levels, and distances. The dataset is organized into four folders such as single star gooseberry fruit, multiple fruits, overlapped, and annotated samples of overlapped star gooseberry fruits including fruit samples with different ripeness stages. This publicly accessible dataset is expected to benefit the research community, enabling advancement in computer vision and AI Applications. It can be accessed at DOI: <span><span>10.17632/2255bdy9mm.1</span><svg><path></path></svg></span></div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111825"},"PeriodicalIF":1.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144534812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-23DOI: 10.1016/j.dib.2025.111821
Nurul Syahidah Mio Asni , Norazlan Mohmad Misnan , Ahmed Mediani , Ivana Nur Allisya Rozlan , Nurul Amalia Zahari , Syarul Nataqain Baharum , Nurkhalida Kamal
{"title":"Mass spectrometry dataset of conventional and organic tempe before and after in vitro digestion","authors":"Nurul Syahidah Mio Asni , Norazlan Mohmad Misnan , Ahmed Mediani , Ivana Nur Allisya Rozlan , Nurul Amalia Zahari , Syarul Nataqain Baharum , Nurkhalida Kamal","doi":"10.1016/j.dib.2025.111821","DOIUrl":"10.1016/j.dib.2025.111821","url":null,"abstract":"<div><div>Tempe is a superior plant-based protein source that provides a diverse array of nutritional benefits as a result of the presence of bioactive metabolites. Nevertheless, there is a scarcity of information regarding the metabolomics profile between organic and conventional tempe and the fate of these metabolites after <em>in vitro</em> digestion. This report examines the metabolomic profile of soybean as raw material and tempe prior to and following the <em>in vitro</em> digestion process. We obtained a comprehensive set of metabolomic data using ultra-high-performance liquid chromatography coupled with high-resolution mass spectrometry (UHPLC-HRMS). The metabolomics dataset organized into Excel sheets and structured according to polarity, mass to charge ratio (m/z), retention time, feature name, biological replicates and controls. This data offers preliminary insights into the metabolite profile of tempe samples, encompassing source material soybean, tempe, and tempe digesta.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111821"},"PeriodicalIF":1.0,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complete genome sequences of Vibrio parahaemolyticus strains L2171 and L2181 associated with AHPND in Penaeus vannamei postlarvae by hybrid sequencing","authors":"Guillermo Reyes , Betsy Andrade , Irma Betancourt , Bonny Bayot","doi":"10.1016/j.dib.2025.111819","DOIUrl":"10.1016/j.dib.2025.111819","url":null,"abstract":"<div><div><em>Vibrio parahaemolyticus</em> strains L2171 and L2181 were isolated from a <em>Penaeus vannamei</em> shrimp hatchery. Both strains carry the pVA plasmid harboring the <em>PirAB</em> genes encoding the binary PirAB toxins that cause the acute hepatopancreatic necrosis disease (AHPND) in cultured shrimp. The strains also harbor multidrug resistance (MDR) and a repertoire of virulence factor genes. Our goal was to determine their complete genome sequences and perform a comprehensive analysis of their genetic characteristics. Therefore, the genomes of two strains, which are highly virulent to shrimp were sequenced by Illumina and the PacBio platforms. These data contribute to a better understanding of <em>V. parahaemolyticus</em> and its role as a pathogen in commercially important species such as farmed shrimp, providing valuable insights for disease management in aquaculture.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111819"},"PeriodicalIF":1.0,"publicationDate":"2025-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-22DOI: 10.1016/j.dib.2025.111818
Jefferson Torres-Quezada , Atila Avila-Argudo
{"title":"Dataset of operational energy of forty Andean buildings from 1980 to 2020","authors":"Jefferson Torres-Quezada , Atila Avila-Argudo","doi":"10.1016/j.dib.2025.111818","DOIUrl":"10.1016/j.dib.2025.111818","url":null,"abstract":"<div><div>This dataset presents operational energy data from forty residential buildings constructed between 1980 and 2020 in Cuenca, a city in the Andean region of Ecuador. It includes energy consumption data related to heating, cooling, lighting, electrical appliances, domestic hot water and cooking. Ten sample houses from each decade were selected, representing typical construction practices of their respective periods. The study follows three main stages: (1) Analysis of operational energy consumption, showcasing the evolution of energy use across four decades; (2) Simulation and validation, where energy simulations and calculations are performed for each sample, followed by a validation process using <em>in-situ</em> measurements compared with simulated results; and (3) Data curation, where climate data is compiled and updated for further analysis. This dataset includes files and figures that enhance comprehension and support further research on energy efficiency, sustainable building design, and energy policy development in regions with moderate climates. It also enables comparisons with datasets from other geographic regions, contributing to a broader understanding of energy demand patterns in residential buildings.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111818"},"PeriodicalIF":1.0,"publicationDate":"2025-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-21DOI: 10.1016/j.dib.2025.111823
Marcos Gabriel Mendes Lauande , Geraldo Braz Júnior , João Dallyson Sousa de Almeida , Vandecia Rejane Monteiro Fernandes , Anselmo Cardoso de Paiva , Rui Miguel Gil da Costa , Amanda Mara Teles , Leandro Lima da Silva , Haissa Oliveira Brito , Flávia Castello Branco Vidal
{"title":"PCPAm - A dataset of histopathological images of penile cancer for classification tasks","authors":"Marcos Gabriel Mendes Lauande , Geraldo Braz Júnior , João Dallyson Sousa de Almeida , Vandecia Rejane Monteiro Fernandes , Anselmo Cardoso de Paiva , Rui Miguel Gil da Costa , Amanda Mara Teles , Leandro Lima da Silva , Haissa Oliveira Brito , Flávia Castello Branco Vidal","doi":"10.1016/j.dib.2025.111823","DOIUrl":"10.1016/j.dib.2025.111823","url":null,"abstract":"<div><div>Penile cancer has an incidence strongly linked to sociocultural factors, being more common in underdeveloped countries like Brazil, where it represents approximately 2% of cancers affecting men. This dataset was created to address the scarcity of publicly available resources for classifying histopathological images in penile cancer research. The images were collected in 2021 from tissue samples obtained through biopsies of patients undergoing treatment for penile cancer. After staining with Hematoxylin and Eosin (H&E), the tissue samples were photographed using a Leica ICC50 HD camera attached to a bright-field microscope (Leica DM500). The dataset comprises 194 high-resolution images (2048 × 1536 pixels), categorized by magnification (40X and 100X) and pathological classification (Tumor or Non-Tumor). Metadata includes additional information such as histological grade and, for some images, HPV status. Although previous works have focused primarily on binary classification tasks, the dataset includes additional labels, such as histological grade and HPV (Human Papilloma Virus) presence, which provide opportunities for multi-label classification or other types of predictive modelling. These extended labels enhance the dataset’s versatility for more complex tasks in medical image analysis. The dataset holds significant reuse potential for machine learning tasks beyond binary classification, allowing researchers to explore additional layers of analysis, such as HPV detection and histological grading. It can also be used for model benchmarking and comparative studies in cancer research, contributing to developing new diagnostic tools. The dataset and metadata are available for further research and model development.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111823"},"PeriodicalIF":1.0,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-06-21DOI: 10.1016/j.dib.2025.111816
Amandine Cunty , Jessica Dittmer , Déborah Merda , Bruno Legendre , Benoit Remenant , Yannick Blanchard , Sophie Cesbron , Marie-Agnès Jacques , Pascal Gentit , Anne-Laure Boutigny
{"title":"Complete genome sequence data of Xylella fastidiosa subspecies multiplex ST88 and ST89 indicate distinct introductions in France","authors":"Amandine Cunty , Jessica Dittmer , Déborah Merda , Bruno Legendre , Benoit Remenant , Yannick Blanchard , Sophie Cesbron , Marie-Agnès Jacques , Pascal Gentit , Anne-Laure Boutigny","doi":"10.1016/j.dib.2025.111816","DOIUrl":"10.1016/j.dib.2025.111816","url":null,"abstract":"<div><div><em>Xylella fastidiosa</em> is a Gram-negative bacterium native to the Americas and classified as a priority pest under EU regulations. This xylem-limited plant pathogenic bacterium has a wide host range and is transmitted by insect vectors. Since 2013, <em>X. fastidiosa</em> has been identified in several European countries including Italy, France, Spain and Portugal, with different subspecies and sequence types (ST) detected. Since 2015, most strains identified in France are of the subspecies <em>multiplex,</em> specifically ST6 and ST7. Two new STs of <em>X. fastidiosa</em> subsp. <em>multiplex,</em> ST88 and ST89, were recently detected in the region Provence-Alpes-Côte d’Azur (PACA), and one strain of each ST has been isolated from infected plants. To investigate the phylogenetic relationships between the four STs present in France, a complete circular genome and a single-contig genome were assembled for the ST89 and ST88 strains, respectively, by combining PacBio and Illumina sequencing data. A phylogenomic analysis was performed to investigate the phylogenetic position and potential origin of these new strains. This data article contributes to improve our knowledge of the diversity and origin of <em>X. fastidiosa</em> subsp. <em>multiplex</em> in France and Europe.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111816"},"PeriodicalIF":1.0,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}