Liangyue Li, Hanghang Tong, Nan Cao, Kate Ehrlich, Y. Lin, Norbou Buchler
{"title":"TEAMOPT: Interactive Team Optimization in Big Networks","authors":"Liangyue Li, Hanghang Tong, Nan Cao, Kate Ehrlich, Y. Lin, Norbou Buchler","doi":"10.1145/2983323.2983340","DOIUrl":"https://doi.org/10.1145/2983323.2983340","url":null,"abstract":"The science of team science is a rapidly emerging research field that studies strategies to understand and enhance the process and outcomes of collaborative, team-based research. An interesting research question we address in this work is how to maintain and optimize the team performance should certain changes happen to the team. In particular, we take the network approach to understanding the teams and consider optimizing the teams with several operations (e.g., replacement, expansion, shrinkage). We develop TEAMOPT, a system to assist users in optimizing the team performance interactively to support the changes to a team. TEAMOPT takes as input a large network of individuals (e.g., co-author network of researchers) and is able to assist users in assembling a team with specific requirements and optimizing the team in response to the changes made to the team. It is effective in finding the best candidates, and interactive with users' feedback in the loop. The system is developed using HTML5, JavaScript, D3.js (front-end) and Python CGI (back-end). A prototype system is already deployed. We will invite the audience to experiment with our TEAMOPT in terms of its effectiveness, efficiency and applicability to various scenarios.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124136200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaomo Liu, Quanzhi Li, Armineh Nourbakhsh, Rui Fang, Merine Thomas, Kajsa Anderson, Russell Kociuba, Mark Vedder, Steven Pomerville, Ramdev Wudali, Robert Martin, John Duprey, Arun Vachher, William Keenan, Sameena Shah
{"title":"Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter","authors":"Xiaomo Liu, Quanzhi Li, Armineh Nourbakhsh, Rui Fang, Merine Thomas, Kajsa Anderson, Russell Kociuba, Mark Vedder, Steven Pomerville, Ramdev Wudali, Robert Martin, John Duprey, Arun Vachher, William Keenan, Sameena Shah","doi":"10.1145/2983323.2983363","DOIUrl":"https://doi.org/10.1145/2983323.2983363","url":null,"abstract":"News professionals are facing the challenge of discovering news from more diverse and unreliable information in the age of social media. More and more news events break on social media first and are picked up by news media subsequently. The recent Brussels attack is such an example. At Reuters, a global news agency, we have observed the necessity of providing a more effective tool that can help our journalists to quickly discover news on social media, verify them and then inform the public. In this paper, we describe Reuters Tracer, a system for sifting through all noise to detect news events on Twitter and assessing their veracity. We disclose the architecture of our system and discuss the various design strategies that facilitate the implementation of machine learning models for noise filtering and event detection. These techniques have been implemented at large scale and successfully discovered breaking news faster than traditional journalism","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126329478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyou Guo, Jun Xu, Xiaohui Yan, Jianpeng Hou, Ping Li, Zhaohui Li, J. Guo, Xueqi Cheng
{"title":"Ease the Process of Machine Learning with Dataflow","authors":"Tianyou Guo, Jun Xu, Xiaohui Yan, Jianpeng Hou, Ping Li, Zhaohui Li, J. Guo, Xueqi Cheng","doi":"10.1145/2983323.2983327","DOIUrl":"https://doi.org/10.1145/2983323.2983327","url":null,"abstract":"Machine learning algorithms have become the key components in many big data applications. However, the full potential of machine learning is still far from been realized because using machine learning algorithms is hard, especially on distributed platforms such as Hadoop and Spark. The key barriers come from not only the implementation of the algorithms themselves, but also the processing for applying them to real applications which often involve multiple steps and different algorithms. In this demo we present a general-purpose dataflow-based system for easing the process of applying machine learning algorithms to real world tasks. In the system, a learning task is formulated as a directed acyclic graph (DAG) in which each node represents an operation (e.g., a machine learning algorithm), and each edge represents the flow of the data from one node to its descendants. Graphical user interface is implemented for making users to create, configure, submit, and monitor a task in a drag-and-drop manner. Advantages of the system include 1) lowering the barriers of defining and executing machine learning tasks; 2) sharing and re-using the implementations of the algorithms, the task dataflow DAGs, and the (intermediate) experimental results; 3) seamlessly integrating the stand-alone algorithms as well as the distributed algorithms in one task. The system has been deployed as a machine learning service and can be access from the Internet.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126220628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-based Recommendations of Entities for Web Search Users","authors":"Ignacio Fernández-Tobías, Roi Blanco","doi":"10.1145/2983323.2983823","DOIUrl":"https://doi.org/10.1145/2983323.2983823","url":null,"abstract":"Modern search engines have evolved from mere document retrieval systems to platforms that assist the users in discovering new information. In this context, entity recommendation systems exploit query log data to proactively provide the users with suggestions of entities (people, movies, places, etc.) from knowledge bases that are relevant for their current information need. Previous works consider the problem of ranking facts and entities related to the user's current query, or focus on specific recommendation domains requiring supervised selection and extraction of features from knowledge bases. In this paper we propose a set of domain-agnostic methods based on nearest neighbors collaborative filtering that exploit query log data to generate entity suggestions, taking into account the user's full search session. Our experimental results on a large dataset from a commercial search engine show that the proposed methods are able to compute relevant entity recommendations outperforming a number of baselines. Finally, we perform an analysis on a cross-domain scenario using different entity types, and conclude that even if knowing the right target domain is important for providing effective recommendations, some inter-domain user interactions are helpful for the task at hand.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126410872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yashar Moshfeghi, Kristiyan Velinov, P. Triantafillou
{"title":"Improving Search Results with Prior Similar Queries","authors":"Yashar Moshfeghi, Kristiyan Velinov, P. Triantafillou","doi":"10.1145/2983323.2983890","DOIUrl":"https://doi.org/10.1145/2983323.2983890","url":null,"abstract":"This paper describes a novel approach to re-ranking search engine result pages (SERP): Its fundamental principle is to re-rank results to a given query, based on exploiting evidence gathered from past similar search queries. Our approach is inspired by collaborative filtering, with the main challenge being to find the set of similar queries, while also taking efficiency into account. In particular, our approach aims to address this challenge by proposing a combination of a similarity graph and a locality sensitive hashing scheme. We construct a set of features from our similarity graph and build a prediction model using the Hoeffding decision tree algorithm. We have evaluated the effectiveness of our model in terms of P@1, MAP@10, and nDCG@10, using the Yandex Data Challenge data set. We have compared the performance of our model against two baselines, namely, the Yandex initial ranking and the decision tree model learnt on the same set of features when extracted based on query repetition (i.e. excluding the evidence of similar queries in our approach). Our results reveal that the proposed approach consistently and (statistically) significantly outperforms both baselines.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125450063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tracking Virality and Susceptibility in Social Media","authors":"Tuan-Anh Hoang, Ee-Peng Lim","doi":"10.1145/2983323.2983800","DOIUrl":"https://doi.org/10.1145/2983323.2983800","url":null,"abstract":"In social media, the magnitude of information propagation hinges on the virality and susceptibility of users spreading and receiving the information respectively, as well as the virality of information items. These users' and items' behavioral factors evolve dynamically at the same time interacting with one another. Previous works however measure the factors statically and independently in a restricted case: each user has only a single adoption on each item, and/or users' exposure to items are observable. In this work, we investigate the inter-relationship among the factors and users' multiple adoptions on items to propose both new static and temporal models for measuring the factors without requiring user - item exposure. These models are designed to cope with even more realistic propagation scenarios where an item may be propagated many times from the same user(s) to the same other user(s). We further propose an incremental model for measuring the factors in large data streams. We evaluated the proposed models and existing models through extensive experiments on a large Twitter dataset covering information propagation in one month. The experiments show that our proposed models can effectively mine the behavioral factors and outperform the existing ones in a propagation prediction task. The incremental model is shown more than 10 times faster than the temporal model, while still obtains very similar results.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125484742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Evolutionary Filtering in Real-Time Twitter Stream","authors":"Feifan Fan, Yansong Feng, Lili Yao, Dongyan Zhao","doi":"10.1145/2983323.2983760","DOIUrl":"https://doi.org/10.1145/2983323.2983760","url":null,"abstract":"With the explosive growth of microblogging service, Twitter has become a leading platform consisting of real-time world wide information. Users tend to explore breaking news or general topics in Twitter according to their interests. However, the explosive amount of incoming tweets leads users to information overload. Therefore, filtering interesting tweets based on users' interest profiles from real-time stream can be helpful for users to easily access the relevant and key information hidden among the tweets. On the other hand, real-time twitter stream contains enormous amount of noisy and redundant tweets. Hence, the filtering process should consider previously pushed interesting tweets to provide users with diverse tweets. What's more, different from traditional document summarization methods which focus on static dataset, the twitter stream is dynamic, fast-arriving and large-scale, which means we have to decide whether to filter the coming tweet for users from the real-time stream as early as possible. In this paper, we propose a novel adaptive evolutionary filtering framework to push interesting tweets for users from real-time twitter stream. First, we propose an adaptive evolutionary filtering algorithm to filter interesting tweets from the twitter stream with respect to user interest profiles. And then we utilize the maximal marginal relevance model in fixed time window to estimate the relevance and diversity of potential tweets. Besides, to overcome the enormous number of redundant tweets and characterize the diversity of potential tweets, we propose a hierarchical tweet representation learning model (HTM) to learn the tweet representations dynamically over time. Experiments on large scale real-time twitter stream datasets demonstrate the efficiency and effectiveness of our framework.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125623287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Heindorf, Martin Potthast, Benno Stein, Gregor Engels
{"title":"Vandalism Detection in Wikidata","authors":"Stefan Heindorf, Martin Potthast, Benno Stein, Gregor Engels","doi":"10.1145/2983323.2983740","DOIUrl":"https://doi.org/10.1145/2983323.2983740","url":null,"abstract":"Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122050469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Making Sense of Entities and Quantities in Web Tables","authors":"Yusra Ibrahim, Mirek Riedewald, G. Weikum","doi":"10.1145/2983323.2983772","DOIUrl":"https://doi.org/10.1145/2983323.2983772","url":null,"abstract":"HTML tables and spreadsheets on the Internet or in enterprise intranets often contain valuable information, but are created ad-hoc. As a result, they usually lack systematic names for column headers and clear vocabulary for cell values. This limits the re-use of such tables and creates a huge heterogeneity problem when comparing or aggregating multiple tables. This paper aims to overcome this problem by automatically canonicalizing header names and cell values onto concepts, classes, entities and uniquely represented quantities registered in a knowledge base. To this end, we devise a probabilistic graphical model that captures coherence dependencies between cells in tables and candidate items in the space of concepts, entities and quantities. We give specific consideration to quantities which are mapped into a \"measure, value, unit\" triple over a taxonomy of physical (e.g. power consumption), monetary (e.g. revenue), temporal (e.g. date) and dimensionless (e.g. counts) measures. Our experiments with Web tables from diverse domains demonstrate the viability of our method and its benefits over baselines.","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128193155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fatigue Strength Predictor for Steels Using Ensemble Data Mining: Steel Fatigue Strength Predictor","authors":"Ankit Agrawal, A. Choudhary","doi":"10.1145/2983323.2983343","DOIUrl":"https://doi.org/10.1145/2983323.2983343","url":null,"abstract":"Fatigue strength is one of the most important mechanical properties of steel. High cost and time for fatigue testing, and potentially disastrous consequences of fatigue failures motivates the development of predictive models for this property. We have developed advanced data-driven ensemble predictive models for this purpose with an extremely high cross-validated accuracy of >98%, and have deployed these models in a user-friendly online web-tool, which can make very fast predictions of fatigue strength for a given steel represented by its composition and processing information. Such a tool with fast and accurate models is expected to be a very useful resource for the materials science researchers and practitioners to assist in their search for new and improved quality steels. The web-tool is available at http://info.eecs.northwestern.edu/SteelFatigueStrengthPredictor","PeriodicalId":250808,"journal":{"name":"Proceedings of the 25th ACM International on Conference on Information and Knowledge Management","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121630137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}