{"title":"WSDM 2016 Workshop on the Ethics of Online Experimentation","authors":"Fernando Diaz, Solon Barocas","doi":"10.1145/2835776.2855117","DOIUrl":"https://doi.org/10.1145/2835776.2855117","url":null,"abstract":"Online experimentation is now a core and near-constant part of the operation of a production online service, such as a web search engine or social media service. These are large-scale experiments that involve research subjects often numbering in the hundreds of thousands and wide-ranging, computer-automated variations in experimental treatment. In some cases, the results of online experiments may be of use internally to optimize system performance (for example, a test may be conducted to help make web page layout decisions). In other cases, the results may be of academic interest (for example, an experiment may be conducted to test a hypothesis about human behavior). Because of their rapid deployment and broad impact, online experimentation systems provide an extremely valuable tool for scientists and engineers. Despite this statistical power, in some situations, an online experiment can raise difficult ethical questions. One only needs to revisit the conversations resulting from the Facebook emotional contagion experiment to understand that some experiments may, at the very least, warrant careful review before being conducted. Since this episode, scholarship published mainly in the qualitative research and information law communities indicates that this may not be an isolated incident. Ethical and legal problems probably arise in other online experiments, published or not. As experimentation platforms and users become easily accessible, scientists and practitioners may increasingly put the well-being and trust of end users at risk. In light of these concerns, organizations often review online experiments before they are actually conducted. In production settings, the review process might vary with respect to formality or standards across companies and even groups within companies. When intended or used for academic publication, experiments or data may have undergone inconsistent review processes, some implementing academic-style institutional review boards and others none at all. Although there is a suggestion that service providers are concerned about the wellbeing of end users, the community does not","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83421331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Ostroumova, P. Prokhorenkov, E. Samosvat, P. Serdyukov
{"title":"Publication Date Prediction through Reverse Engineering of the Web","authors":"L. Ostroumova, P. Prokhorenkov, E. Samosvat, P. Serdyukov","doi":"10.1145/2835776.2835796","DOIUrl":"https://doi.org/10.1145/2835776.2835796","url":null,"abstract":"In this paper, we focus on one of the most challenging tasks in temporal information retrieval: detection of a web page publication date. The natural approach to this problem is to find the publication date in the HTML body of a page. However, there are two fundamental problems with this approach. First, not all web pages contain the publication dates in their texts. Second, it is hard to distinguish the publication date among all the dates found in the page's text. The approach we suggest in this paper supplements methods of date extraction from the page's text with novel link-based methods of dating. Some of our link-based methods are based on a probabilistic model of the Web graph structure evolution, which relies on the publication dates of web pages as on its parameters. We use this model to estimate the publication dates of web pages: based only on the link structure currently observed, we perform a ``reverse engineering'' to reveal the whole process of the Web's evolution.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91028562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Querying and Tracking Influencers in Social Streams","authors":"Karthik Subbian, C. Aggarwal, J. Srivastava","doi":"10.1145/2835776.2835788","DOIUrl":"https://doi.org/10.1145/2835776.2835788","url":null,"abstract":"Influence analysis is an important problem in social network analysis due to its impact on viral marketing and targeted advertisements. Most of the existing influence analysis methods determine the influencers in a static network with an influence propagation model based on pre-defined edge propagation probabilities. However, none of these models can be queried to find influencers in both context and time-sensitive fashion from a streaming social data. In this paper, we propose an approach to maintain real-time influence scores of users in a social stream using a topic and time-sensitive approach, while the network and topic is constantly evolving over time. We show that our approach is efficient in terms of online maintenance and effective in terms various types of real-time context- and time-sensitive queries. We evaluate our results on both social and collaborative network data sets.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"1101 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76744016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal Formation and Evolution of Online Communities","authors":"Hossein Fani","doi":"10.1145/2835776.2855089","DOIUrl":"https://doi.org/10.1145/2835776.2855089","url":null,"abstract":"Researchers have already studied the identification of online communities and the possible impact or influence relationships from several perspectives. For instance, communities of users that are formed based on shared relationships and topological similarities, or communities that consist of users that share similar content. However, little work has been done on detection of communities that simultaneously share topical and temporal similarities. Furthermore, these studies have not explored the causation relationship between the communities. Causation provides systematic explanation as to why communities are formed and helps to predict future communities. This proposal will address two main research questions: i) how can communities that share topical and temporal similarities be identified, and ii) how can causation relation between different online communities be detected and modelled. We model users' behaviour towards topics of interest through multivariate time series to identify like-minded communities. Further, we employ Granger's concept of causality to infer causation between detected communities from corresponding users' time series. Granger causality is the prominent approach in time series modelling and rests on a firm statistical foundation. We assess the proposed community detection methods through comparison with the state of the art and verify the causal model through its prediction accuracy.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"107 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81501166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref
{"title":"Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop","authors":"Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, Walid G. Aref","doi":"10.1145/2835776.2835841","DOIUrl":"https://doi.org/10.1145/2835776.2835841","url":null,"abstract":"Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86230403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Offline Political Systems by Mining Online Political Data","authors":"D. Lazer, Oren Tsur, Tina Eliassi-Rad","doi":"10.1145/2835776.2855112","DOIUrl":"https://doi.org/10.1145/2835776.2855112","url":null,"abstract":"\"Man is by nature a political animal\", as asserted by Aristotle. This political nature manifests itself in the data we produce and the traces we leave online. In this tutorial, we address a number of fundamental issues regarding mining of political data: What types of data could be considered political? What can we learn from such data? Can we use the data for prediction of political changes, etc? How can these prediction tasks be done efficiently? Can we use online socio-political data in order to get a better understanding of our political systems and of recent political changes? What are the pitfalls and inherent shortcomings of using online data for political analysis? In recent years, with the abundance of data, these questions, among others, have gained importance, especially in light of the global political turmoil and the upcoming 2016 US presidential election. We introduce relevant political science theory, describe the challenges within the framework of computational social science and present state of the art approaches bridging social network analysis, graph mining, and natural language processing.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79076232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative Denoising Auto-Encoders for Top-N Recommender Systems","authors":"Yao Wu, Christopher DuBois, A. Zheng, M. Ester","doi":"10.1145/2835776.2835837","DOIUrl":"https://doi.org/10.1145/2835776.2835837","url":null,"abstract":"Most real-world recommender services measure their performance based on the top-N results shown to the end users. Thus, advances in top-N recommendation have far-ranging consequences in practical applications. In this paper, we present a novel method, called Collaborative Denoising Auto-Encoder (CDAE), for top-N recommendation that utilizes the idea of Denoising Auto-Encoders. We demonstrate that the proposed model is a generalization of several well-known collaborative filtering models but with more flexible components. Thorough experiments are conducted to understand the performance of CDAE under various component settings. Furthermore, experimental results on several public datasets demonstrate that CDAE consistently outperforms state-of-the-art top-N recommendation methods on a variety of common evaluation metrics.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73517419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Past and Future of Systems for Current Events","authors":"Mor Naaman","doi":"10.1145/2835776.2835850","DOIUrl":"https://doi.org/10.1145/2835776.2835850","url":null,"abstract":"People share in social media an overwhelming amount of content from real-world events. These events range from major global events like an uprising or an earthquake, to local events and emergencies such as a fire or a parade; from media events like the Oscar's, to events that enjoy little media coverage such as a conference or a music concert. This shared media represents an important part of our society, culture and history. At the same time, this social media content is still fragmented across services, hard to find, and difficult to consume and understand.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"118 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74597107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote Speaker Bio","authors":"Yiling Chen","doi":"10.1145/2835776.2835845","DOIUrl":"https://doi.org/10.1145/2835776.2835845","url":null,"abstract":"Chen has served as the Tutorial Chair of the ACM Conference on Electronic Commerce (EC), 2008, and on the Program Committee for the International World Wide Web Conference (WWW), 2008, and the International Workshop on Internet and Network Economics (WINE), 2008. She has co-organized the 2nd and the 3rd Workshops on Prediction Markets, 2007-2008. She has also been a reviewer for Management Science, Information Systems Research, Decision Support Systems, Information Systems and e-Business Management, and various conferences. Chen’s awards include Outstanding Paper Award, ACM Conference on Electronic Commerce (EC), 2008; Honorable Mention, Decision Science Institute Doctoral Dissertation Competition, 2006; and eBRC Doctoral Support Award, eBusiness Research Center at Penn State University, 2004.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74501704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Li, Fenglong Ma, Jing Gao, Lu Su, Christopher J. Quinn
{"title":"Crowdsourcing High Quality Labels with a Tight Budget","authors":"Qi Li, Fenglong Ma, Jing Gao, Lu Su, Christopher J. Quinn","doi":"10.1145/2835776.2835797","DOIUrl":"https://doi.org/10.1145/2835776.2835797","url":null,"abstract":"In the past decade, commercial crowdsourcing platforms have revolutionized the ways of classifying and annotating data, especially for large datasets. Obtaining labels for a single instance can be inexpensive, but for large datasets, it is important to allocate budgets wisely. With limited budgets, requesters must trade-off between the quantity of labeled instances and the quality of the final results. Existing budget allocation methods can achieve good quantity but cannot guarantee high quality of individual instances under a tight budget. However, in some scenarios, requesters may be willing to label fewer instances but of higher quality. Moreover, they may have different requirements on quality for different tasks. To address these challenges, we propose a flexible budget allocation framework called Requallo. Requallo allows requesters to set their specific requirements on the labeling quality and maximizes the number of labeled instances that achieve the quality requirement under a tight budget. The budget allocation problem is modeled as a Markov decision process and a sequential labeling policy is produced. The proposed policy greedily searches for the instance to query next as the one that can provide the maximum reward for the goal. The Requallo framework is further extended to consider worker reliability so that the budget can be better allocated. Experiments on two real-world crowdsourcing tasks as well as a simulated task demonstrate that when the budget is tight, the proposed Requallo framework outperforms existing state-of-the-art budget allocation methods from both quantity and quality aspects.","PeriodicalId":20567,"journal":{"name":"Proceedings of the Ninth ACM International Conference on Web Search and Data Mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84581041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}