{"title":"Ensuring High-Quality Private Data for Responsible Data Science","authors":"D. Srivastava, M. Scannapieco, T. Redman","doi":"10.1145/3287168","DOIUrl":"https://doi.org/10.1145/3287168","url":null,"abstract":"High-quality data is critical for effective data science. As the use of data science has grown, so too have concerns that individuals’ rights to privacy will be violated. This has led to the development of data protection regulations around the globe and the use of sophisticated anonymization techniques to protect privacy. Such measures make it more challenging for the data scientist to understand the data, exacerbating issues of data quality. Responsible data science aims to develop useful insights from the data while fully embracing these considerations. We pose the high-level problem in this article, “How can a data scientist develop the needed trust that private data has high quality?” We then identify a series of challenges for various data-centric communities and outline research questions for data quality and privacy researchers, which would need to be addressed to effectively answer the problem posed in this article.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 9"},"PeriodicalIF":0.0,"publicationDate":"2019-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89728219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Financial Regulatory and Risk Management Challenges Stemming from Firm-Specific Digital Misinformation","authors":"K. Casey, K. Casey","doi":"10.1145/3274655","DOIUrl":"https://doi.org/10.1145/3274655","url":null,"abstract":"Event studies are the primary methodology used to test market efficiency. Researchers identify an “event” and test for stock price reaction around that event. For example, Shelor et al. [14] find that insurance company stock prices reacted positively after the 1989 Loma Prieta earthquake. The positive reaction was due to the increased demand for earthquake insurance following this event. Numerous other event studies find stock price reactions to the release of new information. These include Asquith and Mullins (dividend initiation [1]), Fields and Janjigian (Chernobyl nuclear accident [6]), Fields et al. (new regulation [7]) and countless others. Financial asset prices are so responsive to new information that one particular scam preys on this fact. The classic “pump-and-dump” scam involves the creation and spread of false firm-specific information after taking an appropriate position in the firm’s stock. For example, Scam Artist (SA) buys 1,000 shares of Company A stock. After the purchase, SA creates a “hot tip” about new information that will cause Company A’s stock to skyrocket in value. The false information pushes naïve investors to buy Company A stock and push the price higher. SA then sells his Company A stock for a profit. While this scam is illegal, it is also difficult to detect. A recent detected pump-and-dump example includes one reported by McClatchy [10] in which a man created hundreds of Internet identities to post fraudulent stock tips about 20 small-cap firms. He was convicted of fraudulently earning $870,000 in the scheme. According to Leuz et al. [9] the practice remains prevalent. Their study suggests that almost 6% of all active investors participate in “at least one ‘pumpand dump’” scheme with an average loss of 30% of invested funds. Another example is hackers creating a fake Associated Press tweet about a White House attack that injured","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"14 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2019-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76780382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julio César Cortés Ríos, N. Paton, A. Fernandes, Edward Abel, J. Keane
{"title":"Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection","authors":"Julio César Cortés Ríos, N. Paton, A. Fernandes, Edward Abel, J. Keane","doi":"10.1145/3284934","DOIUrl":"https://doi.org/10.1145/3284934","url":null,"abstract":"A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An MCSS problem can be solved using multidimensional optimization techniques that trade off the different objectives. Sometimes one may have uncertain knowledge regarding how well the candidate data sources meet the criteria. In order to overcome this uncertainty, one may rely on end-users or crowds to annotate the data items produced by the sources in relation to the selection criteria. In this article, a proposed Targeted Feedback Collection (TFC) approach is introduced that aims to identify those data items on which feedback should be collected, thereby providing evidence on how the sources satisfy the required criteria. The proposed TFC targets feedback by considering the confidence intervals around the estimated criteria values, with a view to increasing the confidence in the estimates that are most relevant to the multidimensional optimization. Variants of the proposed TFC approach have been developed for use where feedback is expected to be reliable (e.g., where it is provided by trusted experts) and where feedback is expected to be unreliable (e.g., from crowd workers). Both variants have been evaluated, and positive results are reported against other approaches to feedback collection, including active learning, in experiments that involve real-world datasets and crowdsourcing.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"46 1","pages":"1 - 27"},"PeriodicalIF":0.0,"publicationDate":"2019-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90774608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News","authors":"Luís Borges, Bruno Martins, P. Calado","doi":"10.1145/3287763","DOIUrl":"https://doi.org/10.1145/3287763","url":null,"abstract":"Fake news is nowadays an issue of pressing concern, given its recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in early 2017 to encourage the development of machine-learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a novel approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural network architecture that leverages ideas previously advanced in the context of learning-efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional Recurrent Neural Networks (RNNs), together with max-pooling over the temporal/sequential dimension and neural attention, for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article toward the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods to pre-train specific components of the neural network architecture (e.g., the RNNs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state of the art.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"11 1","pages":"1 - 26"},"PeriodicalIF":0.0,"publicationDate":"2018-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87067814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"To Clean or Not to Clean","authors":"Dwaipayan Roy, Mandar Mitra, Debasis Ganguly","doi":"10.1145/3242180","DOIUrl":"https://doi.org/10.1145/3242180","url":null,"abstract":"Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"18 6","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2018-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91468524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johannes Kiesel, Florian Kneist, Milad Alshomary, Benno Stein, Matthias Hagen, Martin Potthast
{"title":"Reproducible Web Corpora","authors":"Johannes Kiesel, Florian Kneist, Milad Alshomary, Benno Stein, Matthias Hagen, Martin Potthast","doi":"10.1145/3239574","DOIUrl":"https://doi.org/10.1145/3239574","url":null,"abstract":"The evolution of web pages from static HTML pages toward dynamic pieces of software has rendered archiving them increasingly difficult. Nevertheless, an accurate, reproducible web archive is a necessity to ensure the reproducibility of web-based research. Archiving web pages reproducibly, however, is currently not part of best practices for web corpus construction. As a result, and despite the ongoing efforts of other stakeholders to archive the web, tools for the construction of reproducible web corpora are insufficient or ill-fitted. This article presents a new tool tailored to this purpose. It relies on emulating user interactions with a web page while recording all network traffic. The customizable user interactions can be replayed on demand, while requests sent by the archived page are served with the recorded responses. The tool facilitates reproducible user studies, user simulations, and evaluations of algorithms that rely on extracting data from web pages. To evaluate our tool, we conduct the first systematic assessment of reproduction quality for rendered web pages. Using our tool, we create a corpus of 10,000 web pages carefully sampled from the Common Crawl and manually annotated with regard to reproduction quality via crowdsourcing. Based on this data, we test three approaches to automatic reproduction-quality assessment. An off-the-shelf neural network, trained on visual differences between the web page during archiving and reproduction, matches the manual assessments best. This automatic assessment of reproduction quality allows for immediate bugfixing during archiving and continuous development of our tool as the web continues to evolve.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"24 1","pages":"1 - 25"},"PeriodicalIF":0.0,"publicationDate":"2018-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84614179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue on Reproducibility in Information Retrieval","authors":"N. Ferro, N. Fuhr, A. Rauber","doi":"10.1145/3268410","DOIUrl":"https://doi.org/10.1145/3268410","url":null,"abstract":"Information Retrieval (IR) is a discipline that has been strongly rooted in experimentation since its inception. Experimental evaluation has always been a strong driver for IR research and innovation, and these activities have been shaped by large-scale evaluation campaigns such as Text REtrieval Conference (TREC) in the U.S., Conference and Labs of the Evaluation Forum (CLEF) in Europe, NII Testbeds and Community for Information access Research (NTCIR) in Japan and Asia, and Forum for Information Retrieval Evaluation (FIRE) in India. IR systems are getting increasingly complex. They need to cross language and media barriers; they span from unstructured, via semi-structured, to highly structured data; and they are faced with diverse, complex, and frequently underspecified (ambiguously specified) information needs, search tasks, and societal challenges. As a consequence, evaluation and experimentation, which has remained a fundamental element, has in turn become increasingly sophisticated and challenging.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"76 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2018-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83408038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Anserini","authors":"Peilin Yang, Hui Fang, Jimmy J. Lin","doi":"10.1145/3239571","DOIUrl":"https://doi.org/10.1145/3239571","url":null,"abstract":"This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic information retrieval researchers have a long history of building and sharing systems, they are primarily designed to facilitate the publication of research papers. As such, these systems are often incomplete, inflexible, poorly documented, difficult to use, and slow, particularly in the context of modern web-scale collections. Furthermore, the growing complexity of modern software ecosystems and the resource constraints most academic research groups operate under make maintaining open-source systems a constant struggle. However, except for a small number of companies (mostly commercial web search engines) that deploy custom infrastructure, Lucene has become the de facto platform in industry for building search applications. Lucene has an active developer base, a large audience of users, and diverse capabilities to work with heterogeneous collections at scale. However, it lacks systematic support for ad hoc experimentation using standard test collections. We describe Anserini, an information retrieval toolkit built on Lucene that fills this gap. Our goal is to simplify ad hoc experimentation and allow researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections. With Anserini, we demonstrate that Lucene provides a suitable framework for supporting information retrieval research. Experiments show that our system efficiently indexes large web collections, provides modern ranking models that are on par with research implementations in terms of effectiveness, and supports low-latency query evaluation to facilitate rapid experimentation","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"1 1","pages":"1 - 20"},"PeriodicalIF":0.0,"publicationDate":"2018-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78839552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Hopfgartner, A. Hanbury, H. Müller, Ivan Eggel, K. Balog, Torben Brodt, G. Cormack, Jimmy J. Lin, Jayashree Kalpathy-Cramer, N. Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast, E. Viegas, Simon Mercer
{"title":"Evaluation-as-a-Service for the Computational Sciences","authors":"F. Hopfgartner, A. Hanbury, H. Müller, Ivan Eggel, K. Balog, Torben Brodt, G. Cormack, Jimmy J. Lin, Jayashree Kalpathy-Cramer, N. Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast, E. Viegas, Simon Mercer","doi":"10.1145/3239570","DOIUrl":"https://doi.org/10.1145/3239570","url":null,"abstract":"Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Crowdsourcing has also changed the way in which industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. This article is based on discussions at a workshop on Evaluation-as-a-Service (EaaS). EaaS is the paradigm of not providing data sets to participants and have them work on the data locally, but keeping the data central and allowing access via Application Programming Interfaces (API), Virtual Machines (VM), or other possibilities to ship executables. The objectives of this article are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly toward sustainable research infrastructures. The article summarizes several existing approaches to EaaS and analyzes their usage scenarios and also the advantages and disadvantages. The many factors influencing EaaS are summarized, and the environment in terms of motivations for the various stakeholders, from funding agencies to challenge organizers, researchers and participants, to industry interested in supplying real-world problems for which they require solutions. EaaS solves many problems of the current research environment, where data sets are often not accessible to many researchers. Executables of published tools are equally often not available making the reproducibility of results impossible. EaaS, however, creates reusable/citable data sets as well as available executables. Many challenges remain, but such a framework for research can also foster more collaboration between researchers, potentially increasing the speed of obtaining research results.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"37 1","pages":"1 - 32"},"PeriodicalIF":0.0,"publicationDate":"2018-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81102885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Issue on Reproducibility in Information Retrieval","authors":"N. Ferro, N. Fuhr, A. Rauber","doi":"10.1145/3268408","DOIUrl":"https://doi.org/10.1145/3268408","url":null,"abstract":"Information Retrieval (IR) is a discipline that has been strongly rooted in experimentation since its inception. Experimental evaluation has always been a strong driver for IR research and innovation, and these activities have been shaped by large-scale evaluation campaigns such as Text REtrieval Conference (TREC) in the US, Conference and Labs of the Evaluation Forum (CLEF) in Europe, NII Testbeds and Community for Information access Research (NTCIR) in Japan and Asia, and Forum for Information Retrieval Evaluation (FIRE) in India. IR systems are becoming increasingly complex. They need to cross language and media barriers; they span from unstructured, via semi-structured, to highly structured data; and they are faced with diverse, complex, and frequently underspecified (ambiguously specified) information needs, search tasks, and societal challenges. As a consequence, evaluation and experimentation, which has remained a fundamental element, has in turn become increasingly sophisticated and challenging.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"72 1","pages":"1 - 4"},"PeriodicalIF":0.0,"publicationDate":"2018-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77178502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}