D. Krivoguz, S. Chernyi, Elena Zinchenko, Artem Silkin, A. Zinchenko
{"title":"Using Landsat-5 for Accurate Historical LULC Classification: A Comparison of Machine Learning Models","authors":"D. Krivoguz, S. Chernyi, Elena Zinchenko, Artem Silkin, A. Zinchenko","doi":"10.3390/data8090138","DOIUrl":"https://doi.org/10.3390/data8090138","url":null,"abstract":"This study investigates the application of various machine learning models for land use and land cover (LULC) classification in the Kerch Peninsula. The study utilizes archival field data, cadastral data, and published scientific literature for model training and testing, using Landsat-5 imagery from 1990 as input data. Four machine learning models (deep neural network, Random Forest, support vector machine (SVM), and AdaBoost) are employed, and their hyperparameters are tuned using random search and grid search. Model performance is evaluated through cross-validation and confusion matrices. The deep neural network achieves the highest accuracy (96.2%) and performs well in classifying water, urban lands, open soils, and high vegetation. However, it faces challenges in classifying grasslands, bare lands, and agricultural areas. The Random Forest model achieves an accuracy of 90.5% but struggles with differentiating high vegetation from agricultural lands. The SVM model achieves an accuracy of 86.1%, while the AdaBoost model performs the lowest with an accuracy of 58.4%. The novel contributions of this study include the comparison and evaluation of multiple machine learning models for land use classification in the Kerch Peninsula. The deep neural network and Random Forest models outperform SVM and AdaBoost in terms of accuracy. However, the use of limited data sources such as cadastral data and scientific articles may introduce limitations and potential errors. Future research should consider incorporating field studies and additional data sources for improved accuracy. This study provides valuable insights for land use classification, facilitating the assessment and management of natural resources in the Kerch Peninsula. The findings contribute to informed decision-making processes and lay the groundwork for further research in the field.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"59 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84280879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Horvat, G. Gledec, Tomislav Jagušt, Z. Kalafatić
{"title":"Knowledge Graph Dataset for Semantic Enrichment of Picture Description in NAPS Database","authors":"M. Horvat, G. Gledec, Tomislav Jagušt, Z. Kalafatić","doi":"10.3390/data8090136","DOIUrl":"https://doi.org/10.3390/data8090136","url":null,"abstract":"This data description introduces a comprehensive knowledge graph (KG) dataset with detailed information about the relevant high-level semantics of visual stimuli used to induce emotional states stored in the Nencki Affective Picture System (NAPS) repository. The dataset contains 6808 systematically manually assigned annotations for 1356 NAPS pictures in 5 categories, linked to WordNet synsets and Suggested Upper Merged Ontology (SUMO) concepts presented in a tabular format. Both knowledge databases provide an extensive and supervised taxonomy glossary suitable for describing picture semantics. The annotation glossary consists of 935 WordNet and 513 SUMO entities. A description of the dataset and the specific processes used to collect, process, review, and publish the dataset as open data are also provided. This dataset is unique in that it captures complex objects, scenes, actions, and the overall context of emotional stimuli with knowledge taxonomies at a high level of quality. It provides a valuable resource for a variety of projects investigating emotion, attention, and related phenomena. In addition, researchers can use this dataset to explore the relationship between emotions and high-level semantics or to develop data-retrieval tools to generate personalized stimuli sequences. The dataset is freely available in common formats (Excel and CSV).","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"4 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90351238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Small Tabular Clinical Trial Dataset through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP","authors":"Winston Wang, Tun-Wen Pai","doi":"10.3390/data8090135","DOIUrl":"https://doi.org/10.3390/data8090135","url":null,"abstract":"This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"5 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83725294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. C. C. Nepomuceno, K. Nepomuceno, Fabiano Carlos da Silva, Silas Garrido Teixeira de Carvalho Santos
{"title":"Quantifying Webpage Performance: A Comparative Analysis of TCP/IP and QUIC Communication Protocols for Improved Efficiency","authors":"T. C. C. Nepomuceno, K. Nepomuceno, Fabiano Carlos da Silva, Silas Garrido Teixeira de Carvalho Santos","doi":"10.3390/data8080134","DOIUrl":"https://doi.org/10.3390/data8080134","url":null,"abstract":"Browsing is a prevalent activity on the World Wide Web, and users usually demonstrate significant expectations for expeditious information retrieval and seamless transactions. This article presents a comprehensive performance evaluation of the most frequently accessed webpages in recent years using Data Envelopment Analysis (DEA) adapted to the context (inverse DEA), comparing their performance under two distinct communication protocols: TCP/IP and QUIC. To assess performance disparities, parametric and non-parametric hypothesis tests are employed to investigate the appropriateness of each website’s communication protocols. We provide data on the inputs, outputs, and efficiency scores for 82 out of the world’s top 100 most-accessed websites, describing how experiments and analyses were conducted. The evaluation yields quantitative metrics pertaining to the technical efficiency of the websites and efficient benchmarks for best practices. Nine websites are considered efficient from the point of view of at least one of the communication protocols. Considering TCP/IP, about 80.5% of all units (66 webpages) need to reduce more than 50% of their page load time to be competitive, while this number is 28.05% (23 webpages), considering QUIC communication protocol. In addition, results suggest that TCP/IP protocol has an unfavorable effect on the overall distribution of inefficiencies.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"1 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80034243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VR Traffic Dataset on Broad Range of End-User Activities","authors":"Marina Polupanova","doi":"10.3390/data8080132","DOIUrl":"https://doi.org/10.3390/data8080132","url":null,"abstract":"With the emergence of new internet traffic types in modern transport networks, it has become critical for service providers to understand the structure of that traffic and predict peaks of that load for planning infrastructure expansion. Several studies have investigated traffic parameters for Virtual Reality (VR) applications. Still, most of them test only a partial range of user activities during a limited time interval. This work creates a dataset of captures from a broader spectrum of VR activities performed with a Meta Quest 2 headset, with the duration of each real residential user session recorded for at least half an hour. Newly collected data helped show that some gaming VR traffic activities have a high share of uplink traffic and require symmetric user links. Also, we have figured out that the gaming phase of the overall gameplay is more sensitive to the channel resources reduction than the higher bitrate game launch phase. Hence, we recommend it as a source of traffic distribution for channel sizing model creation. From the gaming phase, capture intervals of more than 100 s contain the most representative information for modeling activity.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"39 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85947086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Bagheri, Diego Reforgiato Recupero, Espen Sirnes
{"title":"Leveraging Return Prediction Approaches for Improved Value-at-Risk Estimation","authors":"F. Bagheri, Diego Reforgiato Recupero, Espen Sirnes","doi":"10.3390/data8080133","DOIUrl":"https://doi.org/10.3390/data8080133","url":null,"abstract":"Value at risk is a statistic used to anticipate the largest possible losses over a specific time frame and within some level of confidence, usually 95% or 99%. For risk management and regulators, it offers a solution for trustworthy quantitative risk management tools. VaR has become the most widely used and accepted indicator of downside risk. Today, commercial banks and financial institutions utilize it as a tool to estimate the size and probability of upcoming losses in portfolios and, as a result, to estimate and manage the degree of risk exposure. The goal is to obtain the average number of VaR “failures” or “breaches” (losses that are more than the VaR) as near to the target rate as possible. It is also desired that the losses be evenly distributed as possible. VaR can be modeled in a variety of ways. The simplest method is to estimate volatility based on prior returns according to the assumption that volatility is constant. Otherwise, the volatility process can be modeled using the GARCH model. Machine learning techniques have been used in recent years to carry out stock market forecasts based on historical time series. A machine learning system is often trained on an in-sample dataset, where it can adjust and improve specific hyperparameters in accordance with the underlying metric. The trained model is tested on an out-of-sample dataset. We compared the baselines for the VaR estimation of a day (d) according to different metrics (i) to their respective variants that included stock return forecast information of d and stock return data of the days before d and (ii) to a GARCH model that included return prediction information of d and stock return data of the days before d. Various strategies such as ARIMA and a proposed ensemble of regressors have been employed to predict stock returns. We observed that the versions of the univariate techniques and GARCH integrated with return predictions outperformed the baselines in four different marketplaces.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"47 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86412668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Bogoyavlenskiy, M. Alexyuk, A. Sadanov, V. Berezin, L. Trenozhnikova, G. Baymakhanova
{"title":"Draft Genome Sequence Data of Streptomyces anulatus, Strain K-31","authors":"A. Bogoyavlenskiy, M. Alexyuk, A. Sadanov, V. Berezin, L. Trenozhnikova, G. Baymakhanova","doi":"10.3390/data8080131","DOIUrl":"https://doi.org/10.3390/data8080131","url":null,"abstract":"Streptomyces anulatus is a typical representative of the Streptomyces genus synthesizing a large number of biologically active compounds. In this study, the draft genome of Streptomyces anulatus, strain K-31 is presented, generated from Illumina reads by SPAdes software. The size of the assembled genome was 8.548838 Mb. Annotation of the S. anulatus genome assembly identified C. hemipterus genome 7749 genes, including 7149 protein-coding genes and 92 RNA genes. This genome will be helpful to further understand Streptomyces genetics and evolution and can be useful for obtained biological active compounds.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"77 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88915154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Bottrighi, Marco Guazzone, G. Leonardi, S. Montani, Manuel Striani, P. Terenziani
{"title":"Towards Action-State Process Model Discovery","authors":"A. Bottrighi, Marco Guazzone, G. Leonardi, S. Montani, Manuel Striani, P. Terenziani","doi":"10.3390/data8080130","DOIUrl":"https://doi.org/10.3390/data8080130","url":null,"abstract":"Process model discovery covers the different methodologies used to mine a process model from traces of process executions, and it has an important role in artificial intelligence research. Current approaches in this area, with a few exceptions, focus on determining a model of the flow of actions only. However, in several contexts, (i) restricting the attention to actions is quite limiting, since the effects of such actions also have to be analyzed, and (ii) traces provide additional pieces of information in the form of states (i.e., values of parameters possibly affected by the actions); for instance, in several medical domains, the traces include both actions and measurements of patient parameters. In this paper, we propose AS-SIM (Action-State SIM), the first approach able to mine a process model that comprehends two distinct classes of nodes, to capture both actions and states.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"14 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79106935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liliya A. Demidova, Peter N. Sovietov, E. Andrianova, Anna A. Demidova
{"title":"Anomaly Detection in Student Activity in Solving Unique Programming Exercises: Motivated Students against Suspicious Ones","authors":"Liliya A. Demidova, Peter N. Sovietov, E. Andrianova, Anna A. Demidova","doi":"10.3390/data8080129","DOIUrl":"https://doi.org/10.3390/data8080129","url":null,"abstract":"This article presents a dataset containing messages from the Digital Teaching Assistant (DTA) system, which records the results from the automatic verification of students’ solutions to unique programming exercises of 11 various types. These results are automatically generated by the system, which automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). The DTA system is trained to distinguish between approaches to solve programming exercises, as well as to identify correct and incorrect solutions, using intelligent algorithms responsible for analyzing the source code in the DTA system using vector representations of programs based on Markov chains, calculating pairwise Jensen–Shannon distances for programs and using a hierarchical clustering algorithm to detect high-level approaches used by students in solving unique programming exercises. In the process of learning, each student must correctly solve 11 unique exercises in order to receive admission to the intermediate certification in the form of a test. In addition, a motivated student may try to find additional approaches to solve exercises they have already solved. At the same time, not all students are able or willing to solve the 11 unique exercises proposed to them; some will resort to outside help in solving all or part of the exercises. Since all information about the interactions of the students with the DTA system is recorded, it is possible to identify different types of students. First of all, the students can be classified into 2 classes: those who failed to solve 11 exercises and those who received admission to the intermediate certification in the form of a test, having solved the 11 unique exercises correctly. However, it is possible to identify classes of typical, motivated and suspicious students among the latter group based on the proposed dataset. The proposed dataset can be used to develop regression models that will predict outbursts of student activity when interacting with the DTA system, to solve clustering problems, to identify groups of students with a similar behavior model in the learning process and to develop intelligent data classifiers that predict the students’ behavior model and draw appropriate conclusions, not only at the end of the learning process but also during the course of it in order to motivate all students, even those who are classified as suspicious, to visualize the results of the learning process using various tools.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"49 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90927233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mateo Cano-Solis, J. Ballesteros, John W. Branch-Bedoya
{"title":"VEPL Dataset: A Vegetation Encroachment in Power Line Corridors Dataset for Semantic Segmentation of Drone Aerial Orthomosaics","authors":"Mateo Cano-Solis, J. Ballesteros, John W. Branch-Bedoya","doi":"10.3390/data8080128","DOIUrl":"https://doi.org/10.3390/data8080128","url":null,"abstract":"Vegetation encroachment in power line corridors has multiple problems for modern energy-dependent societies. Failures due to the contact between power lines and vegetation can result in power outages and millions of dollars in losses. To address this problem, UAVs have emerged as a promising solution due to their ability to quickly and affordably monitor long corridors through autonomous flights or being remotely piloted. However, the extensive and manual task that requires analyzing every image acquired by the UAVs when searching for the existence of vegetation encroachment has led many authors to propose the use of Deep Learning to automate the detection process. Despite the advantages of using a combination of UAV imagery and Deep Learning, there is currently a lack of datasets that help to train Deep Learning models for this specific problem. This paper presents a dataset for the semantic segmentation of vegetation encroachment in power line corridors. RGB orthomosaics were obtained for a rural road area using a commercial UAV. The dataset is composed of pairs of tessellated RGB images, coming from the orthomosaic and corresponding multi-color masks representing three different classes: vegetation, power lines, and the background. A detailed description of the image acquisition process is provided, as well as the labeling task and the data augmentation techniques, among other relevant details to produce the dataset. Researchers would benefit from using the proposed dataset by developing and improving strategies for vegetation encroachment monitoring using UAVs and Deep Learning.","PeriodicalId":55580,"journal":{"name":"Atomic Data and Nuclear Data Tables","volume":"20 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84437962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}