{"title":"Data Transparency and Fairness Analysis of the NYPD Stop-and-Frisk Program","authors":"Y. Badr, Rahul Sharma","doi":"10.1145/3460533","DOIUrl":"https://doi.org/10.1145/3460533","url":null,"abstract":"Given the increased concern of racial disparities in the stop-and-frisk programs, the New York Police Department (NYPD) requires publicly displaying detailed data for all the stops conducted by police authorities, including the suspected offense and race of the suspects. By adopting a public data transparency policy, it becomes possible to investigate racial biases in stop-and-frisk data and demonstrate the benefit of data transparency to approve or disapprove social beliefs and police practices. Thus, data transparency becomes a crucial need in the era of Artificial Intelligence (AI), where police and justice increasingly use different AI techniques not only to understand police practices but also to predict recidivism, crimes, and terrorism. In this study, we develop a predictive analytics method, including bias metrics and bias mitigation techniques to analyze the NYPD Stop-and-Frisk datasets and discover whether underline bias patterns are responsible for stops and arrests. In addition, we perform a fairness analysis on two protected attributes, namely, the race and the gender, and investigate their impacts on arrest decisions. We also apply bias mitigation techniques. The experimental results show that the NYPD Stop-and-Frisk dataset is not biased toward colored and Hispanic individuals and thus law enforcement authorities can apply the bias predictive analytics method to inculcate more fair decisions before making any arrests.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"262 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124282483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web Logs","authors":"Che-Yun Hsu, Ting-Rui Chen, Hung-Hsuan Chen","doi":"10.1145/3490392","DOIUrl":"https://doi.org/10.1145/3490392","url":null,"abstract":"Web logs have been widely used to represent the web page visits of online users. However, we found that web logs in Chrome’s browsing history only record 57% of users’ visited websites, i.e., nearly half of a user’s website visits are not recorded. Additionally, 5.1% of the visits recorded in the web log occur because of unconscious user actions, i.e., these page visits are not initiated from users. We created a Google Chrome plugin and recruited users to install the plugin to collect and analyze the conscious URL visits, unconscious URL visits, and “missing” URL visits (i.e., the visits unrecorded in the traditional web log). We reported the statistics of these behaviors. We showed that sorting popular website categories based on traditional web logs differs from the rankings obtained when including missing visits or excluding unintentional visits. We predicted users’ future behaviors based on three types of training data – all the visits in modern web logs, the intentional visits in web logs, and the intentional visits plus missing visits in web logs. The experimental results indicate that missing visits in web logs may contain additional information, and unintentional visits in web logs may contain more noise than information for user modeling. Consequently, we need to be careful of the observations and conclusions derived from web log analyses because the web log data could be an incomplete and noisy dataset of a user’s visited web pages.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128338159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Anonymization of Workflow Provenance without Compromising the Transparency of Lineage","authors":"Khalid Belhajjame","doi":"10.1145/3460207","DOIUrl":"https://doi.org/10.1145/3460207","url":null,"abstract":"Workflows have been adopted in several scientific fields as a tool for the specification and execution of scientific experiments. In addition to automating the execution of experiments, workflow systems often include capabilities to record provenance information, which contains, among other things, data records used and generated by the workflow as a whole but also by its component modules. It is widely recognized that provenance information can be useful for the interpretation, verification, and re-use of workflow results, justifying its sharing and publication among scientists. However, workflow execution in some branches of science can manipulate sensitive datasets that contain information about individuals. To address this problem, we investigate, in this article, the problem of anonymizing the provenance of workflows. In doing so, we consider a popular class of workflows in which component modules use and generate collections of data records as a result of their invocation, as opposed to a single data record. The solution we propose offers guarantees of confidentiality without compromising lineage information, which provides transparency as to the relationships between the data records used and generated by the workflow modules. We provide algorithmic solutions that show how the provenance of a single module and an entire workflow can be anonymized and present the results of experiments that we conducted for their evaluation.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117251572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sandra Geisler, Maria-Esther Vidal, C. Cappiello, Bernadette Farias Lóscio, A. Gal, M. Jarke, M. Lenzerini, P. Missier, B. Otto, E. Paja, B. Pernici, J. Rehof
{"title":"Knowledge-Driven Data Ecosystems Toward Data Transparency","authors":"Sandra Geisler, Maria-Esther Vidal, C. Cappiello, Bernadette Farias Lóscio, A. Gal, M. Jarke, M. Lenzerini, P. Missier, B. Otto, E. Paja, B. Pernici, J. Rehof","doi":"10.1145/3467022","DOIUrl":"https://doi.org/10.1145/3467022","url":null,"abstract":"A data ecosystem (DE) offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that DEs face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven DE architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Last, we discuss and rate the potential of the proposed architecture in the fulfillmentof these requirements.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132404966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Dargahi, Hossein Ahmadvand, M. Alraja, Chia-Mu Yu
{"title":"Integration of Blockchain with Connected and Autonomous Vehicles: Vision and Challenge","authors":"T. Dargahi, Hossein Ahmadvand, M. Alraja, Chia-Mu Yu","doi":"10.1145/3460003","DOIUrl":"https://doi.org/10.1145/3460003","url":null,"abstract":"Connected and Autonomous Vehicles (CAVs) are introduced to improve individuals’ quality of life by offering a wide range of services. They collect a huge amount of data and exchange them with each other and the infrastructure. The collected data usually includes sensitive information about the users and the surrounding environment. Therefore, data security and privacy are among the main challenges in this industry. Blockchain, an emerging distributed ledger, has been considered by the research community as a potential solution for enhancing data security, integrity, and transparency in Intelligent Transportation Systems (ITS). However, despite the emphasis of governments on the transparency of personal data protection practices, CAV stakeholders have not been successful in communicating appropriate information with the end users regarding the procedure of collecting, storing, and processing their personal data, as well as the data ownership. This article provides a vision of the opportunities and challenges of adopting blockchain in ITS from the “data transparency” and “privacy” perspective. The main aim is to answer the following questions: (1) Considering the amount of personal data collected by the CAVs, such as location, how would the integration of blockchain technology affect transparency, fairness, and lawfulness of personal data processing concerning the data subjects (as this is one of the main principles in the existing data protection regulations)? (2) How can the trade-off between transparency and privacy be addressed in blockchain-based ITS use cases?","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124846859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Degradation of Machine Learning Data Assets","authors":"Lara Mauri, E. Damiani","doi":"10.1145/3446331","DOIUrl":"https://doi.org/10.1145/3446331","url":null,"abstract":"Large-scale adoption of Artificial Intelligence and Machine Learning (AI-ML) models fed by heterogeneous, possibly untrustworthy data sources has spurred interest in estimating degradation of such models due to spurious, adversarial, or low-quality data assets. We propose a quantitative estimate of the severity of classifiers’ training set degradation: an index expressing the deformation of the convex hulls of the classes computed on a held-out dataset generated via an unsupervised technique. We show that our index is computationally light, can be calculated incrementally and complements well existing ML data assets’ quality measures. As an experimentation, we present the computation of our index on a benchmark convolutional image classifier.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132173991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Wang, Pengfei Guo, Xing Wang, Yongzhong He, Wei Wang
{"title":"Transparent Aspect-Level Sentiment Analysis Based on Dependency Syntax Analysis and Its Application on COVID-19","authors":"Bin Wang, Pengfei Guo, Xing Wang, Yongzhong He, Wei Wang","doi":"10.1145/3460002","DOIUrl":"https://doi.org/10.1145/3460002","url":null,"abstract":"Aspect-level sentiment analysis identifies fine-grained emotion for target words. There are three major issues in current models of aspect-level sentiment analysis. First, few models consider the natural language semantic characteristics of the texts. Second, many models consider the location characteristics of the target words, but ignore the relationships among the target words and among the overall sentences. Third, many models lack transparency in data collection, data processing, and results generating in sentiment analysis. In order to resolve these issues, we propose an aspect-level sentiment analysis model that combines a bidirectional Long Short-Term Memory (LSTM) network and a Graph Convolutional Network (GCN) based on Dependency syntax analysis (Bi-LSTM-DGCN). Our model integrates the dependency syntax analysis of the texts, and explicitly considers the natural language semantic characteristics of the texts. It further fuses the target words and overall sentences. Extensive experiments are conducted on four benchmark datasets, i.e., Restaurant14, Laptop, Restaurant16, and Twitter. The experimental results demonstrate that our model outperforms other models like Target-Dependent LSTM (TD-LSTM), Attention-based LSTM with Aspect Embedding (ATAE-LSTM), LSTM+SynATT+TarRep and Convolution over a Dependency Tree (CDT). Our model is further applied to aspect-level sentiment analysis on “government” and “lockdown” of 1,658,250 tweets about “#COVID-19” that we collected from March 1, 2020 to July 1, 2020. The experimental results show that Twitter users’ positive and negative sentiments fluctuated over time. Through the transparency analysis in data collection, data processing, and results generating, we discuss the reasons for the evolution of users’ emotions over time based on the tweets and on our models.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130212851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saravanan Thirumuruganathan, Mayuresh Kunjir, M. Ouzzani, S. Chawla
{"title":"Automated Annotations for AI Data and Model Transparency","authors":"Saravanan Thirumuruganathan, Mayuresh Kunjir, M. Ouzzani, S. Chawla","doi":"10.1145/3460000","DOIUrl":"https://doi.org/10.1145/3460000","url":null,"abstract":"The data and Artificial Intelligence revolution has had a massive impact on enterprises, governments, and society alike. It is fueled by two key factors. First, data have become increasingly abundant and are often available openly. Enterprises have more data than they can process. Governments are spearheading open data initiatives by setting up data portals such as data.gov and releasing large amounts of data to the public. Second, AI engineering development is becoming increasingly democratized. Open source frameworks have enabled even an individual developer to engineer sophisticated AI systems. But with such ease of use comes the potential for irresponsible use of data. Ensuring that AI systems adhere to a set of ethical principles is one of the major problems of our age. We believe that data and model transparency has a key role to play in mitigating the deleterious effects of AI systems. In this article, we describe a framework to synthesize ideas from various domains such as data transparency, data quality, data governance among others to tackle this problem. Specifically, we advocate an approach based on automated annotations (of both data and the AI model), which has a number of appealing properties. The annotations could be used by enterprises to get visibility of potential issues, prepare data transparency reports, create and ensure policy compliance, and evaluate the readiness of data for diverse downstream AI applications. We propose a model architecture and enumerate its key components that could achieve these requirements. Finally, we describe a number of interesting challenges and opportunities.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121812801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PoWareMatch: A Quality-aware Deep Learning Approach to Improve Human Schema Matching","authors":"Roee Shraga, A. Gal","doi":"10.1145/3483423","DOIUrl":"https://doi.org/10.1145/3483423","url":null,"abstract":"Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web, and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering to the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high-quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116026693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abeer A. Al Batayneh, Malik Qasaimeh, Raad S. Al-Qassas
{"title":"A Scoring System for Information Security Governance Framework Using Deep Learning Algorithms: A Case Study on the Banking Sector","authors":"Abeer A. Al Batayneh, Malik Qasaimeh, Raad S. Al-Qassas","doi":"10.1145/3418172","DOIUrl":"https://doi.org/10.1145/3418172","url":null,"abstract":"Cybercrime reports showed an increase in the number of attacks targeting financial institutions. Indeed, banks were the target of 30% of the total number of cyber-attacks. One of the recommended methods for driving the security challenges is to implement an Information Security Governance Framework (ISGF), a comprehensive practice that starts from the top management and ends with the smallest function in a bank. Although such initiatives are effective, they typically take years to achieve and require loads of resources, especially for larger banks or if there are multiple ISGFs available for the bank to choose. These implementation challenges showed the necessity of having a method for evaluating the adequacy of an ISGF for a bank. The research performed during the preparation of this article did not reveal any available structured evaluation method for an ISGF before its implementation. This chapter introduces a novel method for scoring an ISGF to assess its adequacy for a bank without implementing it. The suggested approach is based on ISGF decomposition and transformation into a survey that will be answered by security experts. The survey results were loaded into a Deep Learning Algorithm that produced a scoring model that could predict the adequacy of an ISGF for a bank with an accuracy of 75%.","PeriodicalId":299504,"journal":{"name":"ACM Journal of Data and Information Quality (JDIQ)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116602147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}