{"title":"Viral pneumonia images classification by Multiple Instance Learning: preliminary results","authors":"E. Zumpano, A. Fuduli, E. Vocaturo, Matteo Avolio","doi":"10.1145/3472163.3472170","DOIUrl":"https://doi.org/10.1145/3472163.3472170","url":null,"abstract":"At the end of 2019, the World Health Organization (WHO) referred that the Public Health Commission of Hubei Province, China, reported cases of severe and unknown pneumonia, characterized by fever, malaise, dry cough, dyspnoea and respiratory failure, which occurred in the urban area of Wuhan. A new coronavirus, SARS-CoV-2, was identified as responsible for the lung infection, now called COVID-19 (coronavirus disease 2019). Since then there has been an exponential growth of infections and at the beginning of March 2020 the WHO declared the epidemic a global emergency. An early diagnosis of those carrying the virus becomes crucial to contain the spread, morbidity and mortality of the pandemic. The definitive diagnosis is made through specific tests, among which imaging tests play an important role in the care path of the patient with suspected or confirmed COVID-19. Patients with serious COVID-19 typically experience viral pneumonia. In this paper we launch the idea to use the Multiple Instance Learning paradigm to classify pneumonia X-ray images, considering three different classes: radiographies of healthy people, radiographies of people with bacterial pneumonia and of people with viral pneumonia. The proposed algorithms, which are very fast in practice, appear promising especially if we take into account that no preprocessing technique has been used.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124945308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rigorous Measurement Model for Validity of Big Data: MEGA Approach","authors":"Dave Bhardwaj, O. Ormandjieva","doi":"10.1145/3472163.3472171","DOIUrl":"https://doi.org/10.1145/3472163.3472171","url":null,"abstract":"Big Data is becoming a substantial part of the decision-making processes in both industry and academia, especially in areas where Big Data may have a profound impact on businesses and society. However, as more data is being processed, data quality is becoming a genuine issue that negatively affects credibility of the systems we build because of the lack of visibility and transparency of the underlying data. Therefore, Big Data quality measurement is becoming increasingly necessary in assessing whether data can serve its purpose in a particular context (such as Big Data analytics, for example). This research addresses Big Data quality measurement modelling and automation by proposing a novel quality measurement framework for Big Data (MEGA) that objectively assesses the underlying quality characteristics of Big Data (also known as the V's of Big Data) at each step of the Big Data Pipelines. Five of the Big Data V's (Volume, Variety, Velocity, Veracity and Validity) are currently automated by the MEGA framework. In this paper, a new theoretically valid quality measurement model is proposed for an essential quality characteristic of Big Data, called Validity. The proposed measurement information model for Validity of Big Data is a hierarchy of 4 derived measures / indicators and 5 based measures. Validity measurement is illustrated on a running example.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114934789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring Quality of Workers by Goodness-of-Fit of Machine Learning Model in Crowdsourcing","authors":"Yumiko Suzuki","doi":"10.1145/3472163.3472279","DOIUrl":"https://doi.org/10.1145/3472163.3472279","url":null,"abstract":"In this paper, we propose a method for predicting the quality of crowdsourcing workers using the goodness-of-fit (GoF) of machine learning models. We assume a relationship between the quality of workers and the quality of machine-learning models using the outcomes of the workers as training data. This assumption means that if worker quality is high, a machine-learning classifier constructed using the worker’s outcomes can easily predict the outcomes of the worker. If this assumption is confirmed, we can measure the worker quality without using the correct answer sets, and then the requesters can reduce the time and effort. However, if the outcomes by workers are low quality, the input tweet does not correspond to the outcomes. Therefore, if we construct a tweet classifier using input tweets and the classified results by the worker, the prediction of the outcomes by the classifier and that by the workers should differ. We assume that the GoF scores, such as accuracy and F1 scores of the test set using this classifier, correlates to worker quality. Therefore, we can predict worker quality using the GoF scores. In our experiment, we did the tweet classification task using crowdsourcing. We confirmed that the GoF scores and the quality of workers correlate. These results show that we can predict the quality of workers using the GoF scores.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121529149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Categorical Management of Multi-Model Data","authors":"I. Holubová, Pavel Contos, M. Svoboda","doi":"10.1145/3472163.3472166","DOIUrl":"https://doi.org/10.1145/3472163.3472166","url":null,"abstract":"In this vision paper, we introduce an idea of a framework that would enable us to model, represent, and manage multi-model data in a unified and abstract way. Its core idea exploits constructs provided by category theory, which is sufficiently general but still simple enough to cover any of the logical data models used in contemporary databases. Focusing on promising features and taking into account mature and verified principles, we overview the key parts of the framework and outline open questions and research directions that need to be further investigated. The ultimate objective is to pursue the idea of a self-tuning system that would permit us to collapse the traditionally understood conceptual and logical layers into just a single model allowing for unified handling of schemas, data instances, as well as queries.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130784257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Colonization of the Internet","authors":"B. Desai","doi":"10.1145/3472163.3472179","DOIUrl":"https://doi.org/10.1145/3472163.3472179","url":null,"abstract":"The internet was introduced to connect computers and allow communication between these computers. It evolved to provide applications such as email, talk and file sharing with the associated system to search. The files were made available, freely, by users. However, the internet was out of the reach of most people since it required equipment and know-how as well as connection to a computer on the internet. One method of connection used an acoustic coupler and an analog phone. With the introduction of the personal computer and higher speed modems, accessing the internet became easier. The introduction of user-friendly graphical interfaces, as well as the convenience and portablility of laptops and smartphones made the internet much more widely accessible for a broad swath of users. A small number of newly established companies, supported by a large amount of venture capital and a lack of regulation have since established a stranglehold on the internet with billions of people using these applications. Their monopolistic practices and exploitation of the open nature of the internet has created a need in the ordinary person to replace the traditional way of communication with what they provide: in exchange for giving up personal information these persons have become dependent on the service provided. Due to the regulatory desert around privacy and ownership of personal electornic data, a handful of massive corporations have expropriated and exploited aggregated and disaggregated personal information. This amounts, we argue, to the colonization of the internet.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"29 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116349038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Looking for Jobs? Matching Adults with Autism with Potential Employers for Job Opportunities","authors":"Joseph Thomas Bills, Yiu-Kai Ng","doi":"10.1145/3472163.3472270","DOIUrl":"https://doi.org/10.1145/3472163.3472270","url":null,"abstract":"Adults with autism face many difficulties when finding employment, such as struggling with interviews and needing accommodating environments for sensory issues. Autistic adults, however, also have unique skills to contribute to the workplace that companies have recently started to seek after, such as loyalty, close attention to detail, and trustworthiness. To work around these difficulties and help companies find the talent they are looking for we have developed a job-matching system. Our system is based around the stable matching of the Gale-Shapley algorithm to match autistic adults with employers after estimating how both adults with autism and employers would rank the other group. The system also uses filtering to approximate a stable matching even with a changing pool of users and employers, meaning the results are resistant to change as the result of competition. Such a system would be of benefit to both adults with autism and employers and would advance knowledge in recommender systems that match two parties.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127969115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zhao, F. Ravat, Julien Aligon, C. Soulé-Dupuy, Gabriel Ferrettini, I. Megdiche
{"title":"Analysis-oriented Metadata for Data Lakes","authors":"Yan Zhao, F. Ravat, Julien Aligon, C. Soulé-Dupuy, Gabriel Ferrettini, I. Megdiche","doi":"10.1145/3472163.3472273","DOIUrl":"https://doi.org/10.1145/3472163.3472273","url":null,"abstract":"Data lakes are supposed to enable analysts to perform more efficient and efficacious data analysis by crossing multiple existing data sources, processes and analyses. However, it is impossible to achieve that when a data lake does not have a metadata governance system that progressively capitalizes on all the performed analysis experiments. The objective of this paper is to have an easily accessible, reusable data lake that capitalizes on all user experiences. To meet this need, we propose an analysis-oriented metadata model for data lakes. This model includes the descriptive information of datasets and their attributes, as well as all metadata related to the machine learning analyzes performed on these datasets. To illustrate our metadata solution, we implemented an application of data lake metadata management. This application allows users to find and use existing data, processes and analyses by searching relevant metadata stored in a NoSQL data store within the data lake. To demonstrate how to easily discover metadata with the application, we present two use cases, with real data, including datasets similarity detection and machine learning guidance.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Customized Eager-Lazy Data Cleansing for Satisfactory Big Data Veracity","authors":"S. Sahri, Rim Moussa","doi":"10.1145/3472163.3472195","DOIUrl":"https://doi.org/10.1145/3472163.3472195","url":null,"abstract":"Big data systems are becoming mainstream for big data management either for batch processing or real-time processing. In order to extract insights from data, quality issues are very important to address, particularly. A veracity assessment model is consequently needed. In this paper, we propose a model which ties quality of datasets and quality of query resultsets. We particularly examine quality issues raised by a given dataset, order attributes along their fitness for use and correlate veracity metrics to business queries. We validate our work using the open dataset NYC taxi’ trips.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123778167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COVID-19 Concerns in US: Topic Detection in Twitter","authors":"C. Comito","doi":"10.1145/3472163.3472169","DOIUrl":"https://doi.org/10.1145/3472163.3472169","url":null,"abstract":"COVID-19 pandemic is affecting the lives of the citizens worldwide. Epidemiologists, policy makers and clinicians need to understand public concerns and sentiment to make informed decisions and adopt preventive and corrective measures to avoid critical situations. In the last few years, social media become a tool for spreading the news, discussing ideas and comments on world events. In this context, social media plays a key role since represents one of the main source to extract insight into public opinion and sentiment. In particular, Twitter has been already recognized as an important source of health-related information, given the amount of news, opinions and information that is shared by both citizens and official sources. However, it is a challenging issue identifying interesting and useful content from large and noisy text-streams. The study proposed in the paper aims to extract insight from Twitter by detecting the most discussed topics regarding COVID-19. The proposed approach combines peak detection and clustering techniques. Tweets features are first modeled as time series. After that, peaks are detected from the time series, and peaks of textual features are clustered based on the co-occurrence in the tweets. Results, performed over real-world datasets of tweets related to COVID-19 in US, show that the proposed approach is able to accurately detect several relevant topics of interest, spanning from health status and symptoms, to government policy, economic crisis, COVID-19-related updates, prevention, vaccines and treatments.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131487415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Leung, Daryl L. X. Fung, Daniel Mai, Qi Wen, Jason Tran, Joglas Souza
{"title":"Explainable Data Analytics for Disease and Healthcare Informatics","authors":"C. Leung, Daryl L. X. Fung, Daniel Mai, Qi Wen, Jason Tran, Joglas Souza","doi":"10.1145/3472163.3472175","DOIUrl":"https://doi.org/10.1145/3472163.3472175","url":null,"abstract":"With advancements in technology, huge volumes of valuable data have been generated and collected at a rapid velocity from a wide variety of rich data sources. Examples of these valuable data include healthcare and disease data such as privacy-preserving statistics on patients who suffered from diseases like the coronavirus disease 2019 (COVID-19). Analyzing these data can be for social good. For instance, data analytics on the healthcare and disease data often leads to the discovery of useful information and knowledge about the disease. Explainable artificial intelligence (XAI) further enhances the interpretability of the discovered knowledge. Consequently, the explainable data analytics helps people to get a better understanding of the disease, which may inspire them to take part in preventing, detecting, controlling and combating the disease. In this paper, we present an explainable data analytics system for disease and healthcare informatics. Our system consists of two key components. The predictor component analyzes and mines historical disease and healthcare data for making predictions on future data. Although huge volumes of disease and healthcare data have been generated, volumes of available data may vary partially due to privacy concerns. So, the predictor makes predictions with different methods. It uses random forest With sufficient data and neural network-based few-shot learning (FSL) with limited data. The explainer component provides the general model reasoning and a meaningful explanation for specific predictions. As a database engineering application, we evaluate our system by applying it to real-life COVID-19 data. Evaluation results show the practicality of our system in explainable data analytics for disease and healthcare informatics.","PeriodicalId":242683,"journal":{"name":"Proceedings of the 25th International Database Engineering & Applications Symposium","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133687269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}