Data in BriefPub Date : 2025-09-07DOI: 10.1016/j.dib.2025.112046
Fairuz Iqbal Maulana , Yaya Heryadi , Gede Putra Kusuma , Widodo Budiharto
{"title":"Data augmentation English-Indonesia-Madurese parallel corpus dataset using neural machine translation","authors":"Fairuz Iqbal Maulana , Yaya Heryadi , Gede Putra Kusuma , Widodo Budiharto","doi":"10.1016/j.dib.2025.112046","DOIUrl":"10.1016/j.dib.2025.112046","url":null,"abstract":"<div><div>INMAD is a dataset containing a corpus of English-Indonesian-Madurese translated sentences. This corpus stores a list of 23086 lines of sentences, as well as their translations in Indonesian and English. The details of each Madurese translation cover 1 language level, namely the ‘engghi-enten’ level. The framework for creating the dataset consists of two stages. First, the Combine source of parallel corpus to create and improve the quality of sentences corpus. Second, Data Augmentation with Back-translation using MarianMT and combine parallel dataset with original parallel corpus. INMAD received validation from a Madurese language specialist, who also served as the translator for the source of this dataset. Consequently, this dataset can serve as the primary resource for Natural Language Processing (NLP) research, particularly for Madurese at the 'engghi-enten' level.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112046"},"PeriodicalIF":1.4,"publicationDate":"2025-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145095157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-09-07DOI: 10.1016/j.dib.2025.112037
Md Sajedur Rahman , Md Mahfuz Ahmed Nahin , Md Mahbubur Rahman , Mollika Rani , Md Ashraful Islam , Al Bashir , Ahmad Shafkat , Bijon Mallik , Yaqoob Majeed
{"title":"MangoClassify-12: A high-resolution image dataset of twelve indigenous Bangladeshi mango cultivars","authors":"Md Sajedur Rahman , Md Mahfuz Ahmed Nahin , Md Mahbubur Rahman , Mollika Rani , Md Ashraful Islam , Al Bashir , Ahmad Shafkat , Bijon Mallik , Yaqoob Majeed","doi":"10.1016/j.dib.2025.112037","DOIUrl":"10.1016/j.dib.2025.112037","url":null,"abstract":"<div><div>A high-resolution image dataset, MangoClassify-12, comprising 3900 JPEG images of twelve indigenous Bangladeshi mango cultivars, was assembled to enable automated classification. Images were captured between early May and July 10th, 2025, from three distinct production regions (Mirpur-2, Dhaka; Phulbari, Dinajpur; Rajshahi) under natural light using four smartphones. All images were reviewed by agricultural experts to exclude damaged or overripe specimens. The dataset covers twelve cultivars: Amrapali, Himsagar, Harivanga, Langra, Fazli, Gopalbhog, Ranibhog, Gobindobhog, Sundari, Banana Mango, Bari-4 and Khirsapat. Metadata are organized in a structured folder hierarchy. MangoClassify-12 is openly accessible via DOI and supports machine learning applications such as variety identification, quality assessment and mobile-based recognition. By providing raw images without predefined splits or augmentations, the dataset offers a flexible benchmark for computer vision research in agriculture.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112037"},"PeriodicalIF":1.4,"publicationDate":"2025-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145095155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-09-06DOI: 10.1016/j.dib.2025.112034
Andrew Katumba , Sudi Murindanyi , Nixson Okila , Joyce Nakatumba-Nabende , Cosmas Mwikirize , Jonathan Serugunda , Samuel Bugeza , Anthony Oriekot , Juliet Bossa , Eva Nabawanuka
{"title":"A dataset of lung ultrasound images for automated AI-based lung disease classification","authors":"Andrew Katumba , Sudi Murindanyi , Nixson Okila , Joyce Nakatumba-Nabende , Cosmas Mwikirize , Jonathan Serugunda , Samuel Bugeza , Anthony Oriekot , Juliet Bossa , Eva Nabawanuka","doi":"10.1016/j.dib.2025.112034","DOIUrl":"10.1016/j.dib.2025.112034","url":null,"abstract":"<div><div>Lung ultrasound (LUS) is increasingly recognized as a valuable imaging modality for evaluating various pulmonary conditions. Despite its clinical utility, accurate interpretation of LUS remains challenging due to factors such as inter-operator variability, dependence on sonographer expertise, and inherently low signal-to-noise ratios. This article presents a curated benchmark dataset of labelled LUS images acquired in Uganda, intended to support the development of automated, AI-based diagnostic tools for lung disease classification. The dataset comprises 1062 labelled images collected from patients at Mulago National Referral Hospital and Kiruddu Referral Hospital by senior radiologists. The dataset is suitable for training and evaluating convolutional neural network-based models and is expected to facilitate research in developing robust deep learning systems for pulmonary disease diagnosis using LUS.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112034"},"PeriodicalIF":1.4,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145095160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-09-05DOI: 10.1016/j.dib.2025.112016
Noelia Lopez-Duran, David Romero-Organvidez, Fermín L. Cruz, David Benavides
{"title":"Software bug report dataset from Eclipse projects","authors":"Noelia Lopez-Duran, David Romero-Organvidez, Fermín L. Cruz, David Benavides","doi":"10.1016/j.dib.2025.112016","DOIUrl":"10.1016/j.dib.2025.112016","url":null,"abstract":"<div><div>In recent decades, the analysis of data from software projects — including source control systems, defect tracking systems, and code review repositories — has greatly improved our understanding of software development and its evolution. However, obtaining this information can be time-consuming, and the extracted data is not always well-maintained. This paper introduces an extensive dataset generated from Bugzilla repositories, focusing on key products from the Eclipse bug-tracking system. This dataset addresses the need for up-to-date data in existing repositories, preserving crucial historical information that may be lost due to the transition from Bugzilla to newer bug-tracking systems like Jira or GitHub Issues. Our dataset includes 301,378 bug reports along with all related information, organised into different folders that indicate the project in which the bug was filed. Additionally, we present a custom and lightweight Command Line Interface (CLI) tool designed to efficiently extract detailed information from Bugzilla repositories, automating data collection across various Bugzilla instances. The dataset and tool can be utilized for defect prediction, software maintenance, and evolutionary analysis. To the best of our knowledge, this is the largest, most complete, and up-to-date dataset of Eclipse bug reports available.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112016"},"PeriodicalIF":1.4,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145044273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Planes of thought dataset: A new dataset for the measurement of human thought by artificial intelligence model based on Cloninger's theory","authors":"Atra Joudaki , Leyli Mohammad Khanli , Alireza Farnam , Yashar Sarbaz , Jafar Tanha","doi":"10.1016/j.dib.2025.112028","DOIUrl":"10.1016/j.dib.2025.112028","url":null,"abstract":"<div><div>Today, we have seen remarkable progress in artificial intelligence, especially in natural language processing, chatbots, and sentiment analysis. Using sentiment analysis techniques, chatbots can better understand what users say and produce more useful answers. The more the chatbot understands what the users say, the more interactions between the machine and human will be created. To this end, we must be able to define beyond sentiment analysis for artificial intelligence systems. For this, models must be able to describe and measurement of thought. To solve this challenge, we have created a dataset using Cloninger's theory. In this theory, Cloninger created a global model of human thought and its development, considering the evolution of animal learning abilities to measure thought. Since the thought of humans has not been measured before and accurate measurement is required to perform scientific work, our goal in providing this dataset is to enable artificial intelligence models to do this. Cloninger has divided human thought into five different planes. These planes include: sexual (2), material (3), emotional (4), intellectual (5), and spiritual (7). Three experts labeled the first 10,000 frequently used dictionary words to collect this dataset using Cloninger's theory. We then used these labeled words as ground truths to label sentences. In this dataset, we have labeled 20,000 sentences using this theory so that we can use this dataset to make artificial intelligence models more understanding of the user’s statements.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112028"},"PeriodicalIF":1.4,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145095156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLM-based assessment of HTTPS cybersecurity awareness: Dataset from moroccan web users and webmasters","authors":"Abdelhadi Zineddine , Abdeslam Rehaimi , Mohamed Zaoui , Yousra Belfaik , Yassine Sadqi , Said Safi","doi":"10.1016/j.dib.2025.112024","DOIUrl":"10.1016/j.dib.2025.112024","url":null,"abstract":"<div><div>Cybersecurity awareness plays a fundamental role in protecting digital communications, particularly in the deployment and use of the HTTPS protocol. While previous studies have explored website security practices, there is a lack of available datasets that empirically assess both awareness levels and implementation behaviors of web-users and website administrators. This dataset addresses this gap by analyzing cybersecurity awareness and HTTPS-related behaviors of 440 Moroccan voluntary participants, including web users and webmasters. Data was collected via a structured Google Forms survey, disseminated through web development and cybersecurity communities on online platforms such as Facebook, WhatsApp and LinkedIn.</div><div>The responses collected from multiple-choice questions (MCQs) and free-text entries (categorized using the GPT-4o large language model (LLM)) were pre-processed and score-encoded according to a predefined mapping scheme. Participants’ awareness levels were classified as Low, Moderate, or High on total scores. To identify behavioral patterns, the unsupervised KMeans clustering algorithm was applied separately to user and webmaster groups. Principal Component Analysis (PCA) and LLM-based interpretation provided insights into awareness profiles and cybersecurity risk behaviors.</div><div>The dataset includes raw survey responses, score-encoded data, clustering outputs, and LLM-generated awareness assessment reports. It serves both as supplementary material for a novel hybrid cybersecurity assessment methodology for HTTPS deployment presented in [1], and as a standalone resource for researchers and practitioners examining HTTPS usage, certificate management, and behavioral risk profiling. This dataset is a valuable asset for empirical research and practical improvements in cybersecurity awareness within role-based and regional web ecosystems.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112024"},"PeriodicalIF":1.4,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145044275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JujubeBruiseNet: A high-resolution image dataset for bruise detection in Ziziphus mauritiana","authors":"Md Arham Tabib, Sumyia Sabrin Liza, Md Mizanur Rahman","doi":"10.1016/j.dib.2025.112031","DOIUrl":"10.1016/j.dib.2025.112031","url":null,"abstract":"<div><div>The article presents JujubeBruiseNet, a high-resolution image dataset designed for bruise detection in <em>Ziziphus mauritiana</em> (jujube) fruits. <em>Ziziphus mauritiana</em> is a seasonal fruit often found in late summer to early fall. The bruise detection in this fruit is crucial for post-harvesting, fruit processing, and food packaging. Manual detection of bruises is time-consuming and often leads to inaccuracy. Therefore, developing a novel classification model is essential, which will immediately recognize bruises in the fruits and, as a result, decrease human effort, expenses, and production time in the agriculture sector. The dataset contains a total of 1464 original photos categorized by two classes labelled Healthy and Bruised. We collected the fruit from the local market and fields near Savar, Dhaka, Bangladesh, with the help of domain experts in the period from 10th March to 20th March 2025. To reduce outside variations and provide uniformity, the photos were taken under precisely controlled lighting. This article offers a major dataset for researchers to develop effective quality assessment models for post-harvesting fruit sorting and classification. Convolutional neural networks (CNNs) and other computer vision models can be trained exclusively using this dataset to increase the precision of agricultural product bruise recognition. The dataset can facilitate research in computer vision-based agricultural monitoring and fruit quality evaluation, openly accessible on Mendeley Data, link: JujubeBruiseNet: A Dataset for Bruise Detection in Ziziphus mauritiana - Mendeley Data</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112031"},"PeriodicalIF":1.4,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145044274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-09-01DOI: 10.1016/j.dib.2025.112017
Julian Hartje, Abu Zar Shafiullah
{"title":"Validated dataset combining simulations and measurements for emission analysis of naturally ventilated dairy barns","authors":"Julian Hartje, Abu Zar Shafiullah","doi":"10.1016/j.dib.2025.112017","DOIUrl":"10.1016/j.dib.2025.112017","url":null,"abstract":"<div><div>Quantifying emissions from naturally ventilated livestock buildings is challenging due to the large side wall openings. In addition, measurement campaigns are expensive and time consuming and are therefore limited to a few short measurement weeks during the year. However, emission factors or annual averages are extrapolated from these data sets. Simulations can complement this data set by extending it and thus broadening the basis for the extrapolation of emission factors or evaluation of the barn and management system. The dataset presented consists of solution data from computational fluid dynamics (CFD) simulations of naturally ventilated cattle barns and the corresponding simulation and geometry files. The simulations were validated using data sets from measurement campaigns in three naturally ventilated cattle barns in Germany. Together with weather data from the German Weather Service (DWD), weather situations that occurred outside the measurement weeks could be investigated. With the presented data set further investigations are possible. Together with the measured data, simulation techniques, data aggregation and the development of new numerical modelling approaches can be investigated in detail.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112017"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145019056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-09-01DOI: 10.1016/j.dib.2025.112020
Carlos Cambra, Félix Movilla, Félix de Miguel, Daniel Urda, Nuria Velasco, Álvaro Herrero
{"title":"A real-world iiot dataset for predictive maintenance of metalworking fluids","authors":"Carlos Cambra, Félix Movilla, Félix de Miguel, Daniel Urda, Nuria Velasco, Álvaro Herrero","doi":"10.1016/j.dib.2025.112020","DOIUrl":"10.1016/j.dib.2025.112020","url":null,"abstract":"<div><div>This article presents a multivariate time series dataset detailing the physicochemical degradation of an industrial metalworking fluid (MWF). The data were collected continuously over several months from a test tank under typical operational conditions at an industrial facility in Spain. Four critical variables were monitored using industrial-grade sensors: pH, temperature, concentration, and conductivity. The dataset is provided in five CSV files. The primary file, measures.csv, contains the preprocessed time series at a uniform 5-minute frequency, with authentic missing data gaps intentionally preserved to reflect real-world sensor and connectivity issues. The four additional files serve as a comprehensive benchmark for data imputation algorithms. Each of these benchmark files corresponds to a single variable and includes the original data alongside imputed values generated by five distinct methods: K-Nearest Neighbours (KNN), a hybrid model (HybridKCL), an LSTM-based Variational Autoencoder (LSTM-VAE), and both pre-trained and fine-tuned versions of the MOMENT foundation model. This resource enables researchers and practitioners to develop, validate, and compare predictive maintenance models, anomaly detection systems, and advanced imputation techniques. Furthermore, it serves as a valuable educational tool for addressing common challenges in industrial IoT data, fostering advancements in sustainable and efficient manufacturing.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112020"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145019055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data in BriefPub Date : 2025-09-01DOI: 10.1016/j.dib.2025.112019
Anu Lillak , Tim Thompson , Mari Tõrv , Ester Oras
{"title":"FTIR spectroscopy and VSC-based colour assessment dataset for comparative analysis of cremated bones","authors":"Anu Lillak , Tim Thompson , Mari Tõrv , Ester Oras","doi":"10.1016/j.dib.2025.112019","DOIUrl":"10.1016/j.dib.2025.112019","url":null,"abstract":"<div><div>The data presented in this article derives from archaeological cremated bones excavated in 2014–2015 at Aakre Kivivare tarand cemetery, S Estonia. The material covers bone fragments of different colours to be assessed visually, using Video Spectral Comparator (VSC) and analysed comparatively with Fourier Transform Infrared Spectroscopy (FTIR) to determine the structural and compositional changes in the thermally altered bone and implications of the latter in bone colouring.</div><div>The dataset comprises FTIR spectra measurements, colour spectra measured with VSC and visually assessed colour of human and animal bones chosen for the study. This dataset is expected to be a comparative source for determining archaeological cremated bone colour induced by heat-related changes in the bone microstructure, supporting the visual estimations of temperature-based cremation practices in archaeological and forensic bone material in the future.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"62 ","pages":"Article 112019"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145003778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}