{"title":"Editorial: Special Issue on Data Transparency—Data Quality, Annotation, and Provenance","authors":"M. Barhamgi, E. Bertino","doi":"10.1145/3494454","DOIUrl":"https://doi.org/10.1145/3494454","url":null,"abstract":"Advances in Artificial Intelligence (AI) and mobile and Internet technologies have been progressively reshaping our lives over the past few years. The applications of the Internet of Things and cyber-physical systems today touch almost all aspects of our daily lives, including healthcare (e.g., remote patient monitoring environments), leisure (e.g., smart entertainment spaces), and work (e.g., smart manufacturing, asset management). For many of us, social media have become the rule rather than the exception as the way to interact, socialize, and exchange information. AI-powered systems have become a reality and started to affect our lives in important ways. These systems and services collect huge amounts of data about us and exploit it for various purposes that could affect positively or negatively our lives. Even though most of these systems claim to abide by data protection regulations and ethics, data misuse incidents keep making the headlines. In this new digital world, data transparency for end users is becoming a fundamental aspect to consider when designing, implementing, and deploying a system, service, or software [1, 3, 4]. Transparency allows users to track down and follow how their data are collected, transmitted, stored, processed, exploited, and serviced. It also allows them to verify how fairly they are treated by algorithms, software, and systems that affect their lives. Data transparency is a complex concept that is interpreted and approached in different ways by different research communities and bodies. A comprehensive definition of data transparency is proposed by Bertino et al. as “the ability of subjects to effectively gain access to all information related to data used in processes and decisions that affect the subjects” [2].","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"14 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85165385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Radhakrishna, G. Reddy, Puligadda Veereswara Kumar, V. Janaki
{"title":"Challenge Paper: The Vision for Time Profiled Temporal Association Mining","authors":"V. Radhakrishna, G. Reddy, Puligadda Veereswara Kumar, V. Janaki","doi":"10.1145/3404198","DOIUrl":"https://doi.org/10.1145/3404198","url":null,"abstract":"Ecommerce has been the market disruptor in the modern world. Organizations have been focusing on mining enormous amounts of data to identify trends and extract crucial information from the voluminous data. Data is collected in databases and they are transactional in nature. Millions of transactions are collected in a temporal context. Also, organizations are transitioning towards NOSQL databases. The transactions are distributed into timeslots. Such a transaction data is called as timestamped temporal data. Along with the focus on mining the temporal data, extracting patterns, trends, and information, the implicit focus is also on building an efficient algorithm that is accurate with a reduction in the time taken, memory consumed, and computational efforts to scan the database. Temporal associations discovered from timestamped temporal datasets [1, 2] are known as time profiled temporal patterns. From application perspective, time profiled temporal","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"28 1","pages":"1 - 8"},"PeriodicalIF":0.0,"publicationDate":"2021-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78174680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editorial: Special Issue on Quality Assessment and Management in Big Data—Part I","authors":"Shadi A. Aljawarneh, J. Lara","doi":"10.1145/3449052","DOIUrl":"https://doi.org/10.1145/3449052","url":null,"abstract":"It is a pleasure for us to introduce this Special Issue on Quality Assessment and Management in Big Data, Part I—Journal of Data and Information Quality, ACM. We have received 27 original submissions from which 11 final papers have been selected for publication (after a rigorous peer review process) in this issue divided into two parts. This editorial corresponds to Part I, in which we included papers related to machine learning and quality management in big data scenarios. In the era of big data [1], organizations are dealing with tremendous amounts of data, which are fast-moving and can originate from various sources, such as social networks [2], unstructured data from various websites [3], or raw feeds from sensors [4]. Big data solutions are used to optimize business processes and reduce decision-making times, so as to improve operational effectiveness. Big data practitioners are experiencing a huge number of data quality problems [5]. These can be time-consuming to solve or even lead to incorrect data analytics. How to manage quality in big data has become challenging, and thus far research has only addressed limited aspects. Given the complex nature of big data, traditional data quality management approaches cannot simply be applied to big data quality management.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"48 1","pages":"1 - 3"},"PeriodicalIF":0.0,"publicationDate":"2021-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74263919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nelson Novaes Neto, S. Madnick, A. Paula, Natasha Malara Borges
{"title":"Developing a Global Data Breach Database and the Challenges Encountered","authors":"Nelson Novaes Neto, S. Madnick, A. Paula, Natasha Malara Borges","doi":"10.1145/3439873","DOIUrl":"https://doi.org/10.1145/3439873","url":null,"abstract":"If the mantra “data is the new oil” of our digital economy is correct, then data leak incidents are the critical disasters in the online society. The initial goal of our research was to present a comprehensive database of data breaches of personal information that took place in 2018 and 2019. This information was to be drawn from press reports, industry studies, and reports from regulatory agencies across the world. This article identified the top 430 largest data breach incidents among more than 10,000 data breach incidents. In the process, we encountered many complications, especially regarding the lack of standardization of reporting. This article should be especially interesting to the readers of JDIQ because it describes both the range of data quality and consistency issues found as well as what was learned from the database created. The database that was created, available at https://www.databreachdb.com, shows that the number of data records breached in those top 430 incidents increased from around 4B in 2018 to more than 22B in 2019. This increase occurred despite the strong efforts from regulatory agencies across the world to enforce strict rules on data protection and privacy, such as the General Data Protection Regulation (GDPR) that went into effect in Europe in May 2018. Such regulatory effort could explain the reason why there is such a large number of data breach cases reported in the European Union when compared to the U.S. (more than 10,000 data breaches publicly reported in the U.S. since 2018, while the EU reported more than 160,0001 data breaches since May 2018). However, we still face the problem of an excessive number of breach incidents around the world. This research helps to understand the challenges of proper visibility of such incidents on a global scale. The results of this research can help government entities, regulatory bodies, security and data quality researchers, companies, and managers to improve the data quality of data breach reporting and increase the visibility of the data breach landscape around the world in the future.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"64 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2021-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80141319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge Transfer for Entity Resolution with Siamese Neural Networks","authors":"M. Loster, Ioannis K. Koumarelas, Felix Naumann","doi":"10.1145/3410157","DOIUrl":"https://doi.org/10.1145/3410157","url":null,"abstract":"The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"75 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72682060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michela Fazzolari, F. Buccafurri, G. Lax, M. Petrocchi
{"title":"Experience","authors":"Michela Fazzolari, F. Buccafurri, G. Lax, M. Petrocchi","doi":"10.1145/3439307","DOIUrl":"https://doi.org/10.1145/3439307","url":null,"abstract":"Over the past few years, online reviews have become very important, since they can influence the purchase decision of consumers and the reputation of businesses. Therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this contribution, we start from a set of effective features used for classifying opinion spam and we re-engineered them by considering the Cumulative Relative Frequency Distribution of each feature. By an experimental evaluation carried out on real data from Yelp.com, we show that the use of the distributional features is able to improve the performances of classifiers.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"38 1","pages":"1 - 16"},"PeriodicalIF":0.0,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80055136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, W. Tan
{"title":"Deep Entity Matching","authors":"Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, W. Tan","doi":"10.1145/3431816","DOIUrl":"https://doi.org/10.1145/3431816","url":null,"abstract":"Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term “entity matching” also loosely refers to the broader problem of determining whether two heterogeneous representations of different entities should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences. In this article, we first report our recent system DITTO, which is an example of a modern entity matching system based on pretrained language models. Then we summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task. Finally, we discuss research directions beyond entity matching, including the promise of synergistically integrating blocking and entity matching steps together, the need to examine methods to alleviate steep training data requirements that are typical of deep learning or pre-trained language models, and the importance of generalizing entity matching solutions to handle the broader entity matching problem, which leads to an even more pressing need to explain matching outcomes.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"3 1","pages":"1 - 17"},"PeriodicalIF":0.0,"publicationDate":"2021-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89557564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Beneventano, S. Bergamaschi, Luca Gagliardelli, Giovanni Simonini
{"title":"BLAST2","authors":"D. Beneventano, S. Bergamaschi, Luca Gagliardelli, Giovanni Simonini","doi":"10.1145/3394957","DOIUrl":"https://doi.org/10.1145/3394957","url":null,"abstract":"We present BLAST2, a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process, to identify records that refer to the same real-world entity when integrating multiple, heterogeneous, and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n2) comparisons, where n is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world datasets (composed of 7 and 10 data sources, respectively).","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"5 3 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83475980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Ahmadi, Thi-Thuy-Duyen Truong, Le-Hong-Mai Dao, Stefano Ortona, Paolo Papotti
{"title":"RuleHub","authors":"N. Ahmadi, Thi-Thuy-Duyen Truong, Le-Hong-Mai Dao, Stefano Ortona, Paolo Papotti","doi":"10.1145/3409384","DOIUrl":"https://doi.org/10.1145/3409384","url":null,"abstract":"Entity-centric knowledge graphs (KGs) are now popular to collect facts about entities. KGs have rich schemas with a large number of different types and predicates to describe the entities and their relationships. On these rich schemas, logical rules are used to represent dependencies between the data elements. While rules are useful in query answering, data curation, and other tasks, they usually do not come with the KGs. Such rules have to be manually defined or discovered with the help of rule mining methods. We believe this rule-collection task should be done collectively to better capitalize our understanding of the data and to avoid redundant work conducted on the same KGs. For this reason, we introduce RuleHub, our extensible corpus of rules for public KGs. RuleHub provides functionalities for the archival and the retrieval of rules to all users, with an extensible architecture that does not constrain the KG or the type of rules supported. We are populating the corpus with thousands of rules from the most popular KGs and report on our experiments on automatically characterizing the quality of a rule with statistical measures.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 22"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89710693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Discovery of Imprecise Functional Dependencies","authors":"Loredana Caruccio, Stefano Cirillo","doi":"10.1145/3397462","DOIUrl":"https://doi.org/10.1145/3397462","url":null,"abstract":"Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"55 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2020-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84757638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}