Abdullah Al-Ameri, Waleed Al-Shammari, Aniello Castiglione, Michele Nappi, Chiara Pero, Muhammad Umer
{"title":"Student Academic Success Prediction Using Learning Management Multimedia Data With Convoluted Features and Ensemble Model","authors":"Abdullah Al-Ameri, Waleed Al-Shammari, Aniello Castiglione, Michele Nappi, Chiara Pero, Muhammad Umer","doi":"10.1145/3687268","DOIUrl":"https://doi.org/10.1145/3687268","url":null,"abstract":"Predicting students’ academic success is crucial for educational institutions to provide targeted support and interventions to those at risk of underperforming. With the increasing adoption of digital learning management systems (LMS), there has been a surge in multimedia data, opening new avenues for predictive analytics in education. Anticipating students’ academic performance can function as an early alert system for those facing potential failure, enabling educational institutions to implement interventions proactively. This study proposes leveraging features extracted from a convolutional neural network (CNN) in conjunction with machine learning models to enhance predictive accuracy. This approach obviates the need for manual feature extraction and yields superior outcomes compared to using machine learning and deep learning models independently. Initially, nine machine learning models are applied to both the original and convoluted features. The top-performing individual models are then combined into an ensemble model. This research work makes an ensemble of support vector machine (SVM) and random forest (RF) for academic performance prediction. The efficacy of the proposed method is validated against existing models, demonstrating its superior performance. With an accuracy of 97.88%, and precision, recall, and F1 scores of 98%, the proposed approach attains outstanding results in forecasting student academic success. This study contributes to the burgeoning field of predictive analytics in education by showcasing the effectiveness of leveraging multimedia data from learning management systems with convoluted features and ensemble modeling techniques.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"10 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141920507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Active Learning for Data Quality Control: A Survey","authors":"Na Li, Yiyang Qi, Chaoran Li, Zhiming Zhao","doi":"10.1145/3663369","DOIUrl":"https://doi.org/10.1145/3663369","url":null,"abstract":"Data quality plays a vital role in scientific research and decision-making across industries. Thus it is crucial to incorporate the data quality control (DQC) process, which comprises various actions and operations to detect and correct data errors. The increasing adoption of machine learning (ML) techniques in different domains has raised concerns about data quality in the ML field. On the other hand, ML’s capability to uncover complex patterns makes it suitable for addressing challenges involved in the DQC process. However, supervised learning methods demand abundant labeled data, while unsupervised learning methods heavily rely on the underlying distribution of the data. Active learning (AL) provides a promising solution by proactively selecting data points for inspection, thus reducing the burden of data labeling for domain experts. Therefore, this survey focuses on applying AL to DQC. Starting with a review of common data quality issues and solutions in the ML field, we aim to enhance the understanding of current quality assessment methods. We then present two scenarios to illustrate the adoption of AL into the DQC systems on the anomaly detection task, including pool-based and stream-based approaches. Finally, we provide the remaining challenges and research opportunities in this field.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"6 23","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140988485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß
{"title":"Data Validation Utilizing Expert Knowledge and Shape Constraints","authors":"F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß","doi":"10.1145/3661826","DOIUrl":"https://doi.org/10.1145/3661826","url":null,"abstract":"Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data.\u0000 To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data.\u0000 We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 1163","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140988959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editorial: Special Issue on Human in the Loop Data Curation","authors":"Gianluca Demartini, Shazia Sadiq, Jie Yang","doi":"10.1145/3650209","DOIUrl":"https://doi.org/10.1145/3650209","url":null,"abstract":"This Special Issue of the Journal of Data and Information Quality (JDIQ) contains novel theoretical and methodological contributions on data curation involving humans in the loop. In this editorial, we summarize the scope of the issue and briefly describe its content.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140210008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editor-in-Chief (June 2017–November 2023) Farewell Report","authors":"Tiziana Catarci","doi":"10.1145/3651229","DOIUrl":"https://doi.org/10.1145/3651229","url":null,"abstract":"","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 21","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140210958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Connected Components for Scaling Partial-Order Blocking to Billion Entities","authors":"Tobias Backes, Stefan Dietze","doi":"10.1145/3646553","DOIUrl":"https://doi.org/10.1145/3646553","url":null,"abstract":"\u0000 In entity resolution,\u0000 blocking\u0000 pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related\u0000 blocking-keys\u0000 . Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but – as was shown for author disambiguation – the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset\u0000 partial\u0000 order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.\u0000","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"10 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139958252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas
{"title":"Cleenex: Support for User Involvement During an Iterative Data Cleaning Process","authors":"João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas","doi":"10.1145/3648476","DOIUrl":"https://doi.org/10.1145/3648476","url":null,"abstract":"The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process because it may need to be re-executed and refined to produce high quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort.\u0000 Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process and conducted two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user, and a comparison, in terms of user involvement, of data preparation tools with real users.\u0000 Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"867 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139894342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}