Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

筛选
英文 中文
Contextual Spatial Outlier Detection with Metric Learning 基于度量学习的上下文空间离群点检测
Guanjie Zheng, S. Brantley, T. Lauvaux, Z. Li
{"title":"Contextual Spatial Outlier Detection with Metric Learning","authors":"Guanjie Zheng, S. Brantley, T. Lauvaux, Z. Li","doi":"10.1145/3097983.3098143","DOIUrl":"https://doi.org/10.1145/3097983.3098143","url":null,"abstract":"Hydraulic fracturing (or \"fracking\") is a revolutionary well stimulation technique for shale gas extraction, but has spawned controversy in environmental contamination. If methane from gas wells leaks extensively, this greenhouse gas can impact drinking water wells and enhance global warming. Our work is motivated by this heated debate on environmental issue and focuses on general data analytical techniques to detect anomalous spatial data samples (e.g., water samples related to potential leakages). Specifically, we propose a spatial outlier detection method based on contextual neighbors. Different from existing work, our approach utilizes both spatial attributes and non-spatial contextual attributes to define neighbors. We further use robust metric learning to combine different contextual attributes in order to find meaningful neighbors. Our technique can be applied to any spatial dataset. Extensive experimental results on five real-world datasets demonstrate the effectiveness of our approach. We also show some interesting case studies, including one case linking to leakage of a gas well.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130476240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem 洋葱网站自动分类分析暗网生态系统
Shalini Ghosh, Ariyam Das, Phillip A. Porras, V. Yegneswaran, Ashish Gehani
{"title":"Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem","authors":"Shalini Ghosh, Ariyam Das, Phillip A. Porras, V. Yegneswaran, Ashish Gehani","doi":"10.1145/3097983.3098193","DOIUrl":"https://doi.org/10.1145/3097983.3098193","url":null,"abstract":"Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components -- (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115735030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Point-of-Interest Demand Modeling with Human Mobility Patterns 基于人类移动模式的兴趣点需求建模
Yanchi Liu, Chuanren Liu, Xinjiang Lu, Mingfei Teng, Hengshu Zhu, Hui Xiong
{"title":"Point-of-Interest Demand Modeling with Human Mobility Patterns","authors":"Yanchi Liu, Chuanren Liu, Xinjiang Lu, Mingfei Teng, Hengshu Zhu, Hui Xiong","doi":"10.1145/3097983.3098168","DOIUrl":"https://doi.org/10.1145/3097983.3098168","url":null,"abstract":"Point-of-Interest (POI) demand modeling in urban regions is critical for many applications such as business site selection and real estate investment. While some efforts have been made for the demand analysis of some specific POI categories, such as restaurants, it lacks systematic means to support POI demand modeling. To this end, in this paper, we develop a systematic POI demand modeling framework, named Region POI Demand Identification (RPDI), to model POI demands by exploiting the daily needs of people identified from their large-scale mobility data. Specifically, we first partition the urban space into spatially differentiated neighborhood regions formed by many small local communities. Then, the daily activity patterns of people traveling in the city will be extracted from human mobility data. Since the trip activities, even aggregated, are sparse and insufficient to directly identify the POI demands, especially for underdeveloped regions, we develop a latent factor model that integrates human mobility data, POI profiles, and demographic data to robustly model the POI demand of urban regions in a holistic way. In this model, POI preferences and supplies are used together with demographic features to estimate the POI demands simultaneously for all the urban regions interconnected in the city. Moreover, we also design efficient algorithms to optimize the latent model for large-scale data. Finally, experimental results on real-world data in New York City (NYC) show that our method is effective for identifying POI demands for different regions.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114001505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Fast Newton Hard Thresholding Pursuit for Sparsity Constrained Nonconvex Optimization 稀疏约束非凸优化的快速牛顿硬阈值追求
Jinghui Chen, Quanquan Gu
{"title":"Fast Newton Hard Thresholding Pursuit for Sparsity Constrained Nonconvex Optimization","authors":"Jinghui Chen, Quanquan Gu","doi":"10.1145/3097983.3098165","DOIUrl":"https://doi.org/10.1145/3097983.3098165","url":null,"abstract":"We propose a fast Newton hard thresholding pursuit algorithm for sparsity constrained nonconvex optimization. Our proposed algorithm reduces the per-iteration time complexity to linear in the data dimension d compared with cubic time complexity in Newton's method, while preserving faster computational and statistical convergence rates. In particular, we prove that the proposed algorithm converges to the unknown sparse model parameter at a composite rate, namely quadratic at first and linear when it gets close to the true parameter, up to the minimax optimal statistical precision of the underlying model. Thorough experiments on both synthetic and real datasets demonstrate that our algorithm outperforms the state-of-the-art optimization algorithms for sparsity constrained optimization.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123004168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Dipole 偶极子
{"title":"Dipole","authors":"","doi":"10.1145/3097983.3098088","DOIUrl":"https://doi.org/10.1145/3097983.3098088","url":null,"abstract":"Predicting the future health information of patients from the historical Electronic Health Records (EHR) is a core research task in the development of personalized healthcare. Patient EHR data consist of sequences of visits over time, where each visit contains multiple medical codes, including diagnosis, medication, and procedure codes. The most important challenges for this task are to model the temporality and high dimensionality of sequential EHR data and to interpret the prediction results. Existing work solves this problem by employing recurrent neural networks (RNNs) to model EHR data and utilizing simple attention mechanism to interpret the results. However, RNN-based approaches suffer from the problem that the performance of RNNs drops when the length of sequences is large, and the relationships between subsequent visits are ignored by current RNN-based approaches. To address these issues, we propose Dipole, an end-to-end, simple and robust model for predicting patients' future health information. Dipole employs bidirectional recurrent neural networks to remember all the information of both the past visits and the future visits, and it introduces three attention mechanisms to measure the relationships of different visits for the prediction. With the attention mechanisms, Dipole can interpret the prediction results effectively. Dipole also allows us to interpret the learned medical code representations which are confirmed positively by medical experts. Experimental results on two real world EHR datasets show that the proposed Dipole can significantly improve the prediction accuracy compared with the state-of-the-art diagnosis prediction approaches and provide clinically meaningful interpretation.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125958159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
End-to-end Learning for Short Text Expansion 短文本扩展的端到端学习
Jian Tang, Yue Wang, Kai Zheng, Q. Mei
{"title":"End-to-end Learning for Short Text Expansion","authors":"Jian Tang, Yue Wang, Kai Zheng, Q. Mei","doi":"10.1145/3097983.3098166","DOIUrl":"https://doi.org/10.1145/3097983.3098166","url":null,"abstract":"Effectively making sense of short texts is a critical task for many real world applications such as search engines, social media services, and recommender systems. The task is particularly challenging as a short text contains very sparse information, often too sparse for a machine learning algorithm to pick up useful signals. A common practice for analyzing short text is to first expand it with external information, which is usually harvested from a large collection of longer texts. In literature, short text expansion has been done with all kinds of heuristics. We propose an end-to-end solution that automatically learns how to expand short text to optimize a given learning task. A novel deep memory network is proposed to automatically find relevant information from a collection of longer documents and reformulate the short text through a gating mechanism. Using short text classification as a demonstrating task, we show that the deep memory network significantly outperforms classical text expansion methods with comprehensive experiments on real world data sets.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124716500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
"The Leicester City Fairytale?": Utilizing New Soccer Analytics Tools to Compare Performance in the 15/16 & 16/17 EPL Seasons “莱斯特城的童话?”:使用新的足球分析工具来比较15/16赛季和16/17赛季的表现
Héctor Ruiz, P. Power, Xinyu Wei, P. Lucey
{"title":"\"The Leicester City Fairytale?\": Utilizing New Soccer Analytics Tools to Compare Performance in the 15/16 & 16/17 EPL Seasons","authors":"Héctor Ruiz, P. Power, Xinyu Wei, P. Lucey","doi":"10.1145/3097983.3098121","DOIUrl":"https://doi.org/10.1145/3097983.3098121","url":null,"abstract":"The last two years have been somewhat of a rollercoaster for English Premier League (EPL) team Leicester City. In the 2015/16 season, against all odds and logic, they won the league to much fan-fare. Fast-forward nine months later, and they are battling relegation. What could describe this fluctuating form? As soccer is a very complex and strategic game, common statistics (e.g., passes, shots, possession) do not really tell the full story on how a team succeeds and fails. However, using machine learning tools and a plethora of data, it is now possible to obtain some insights into how a team performs. To showcase the utility of these new tools (i.e., expected goal value, expected save value, strategy-plots and passing quality measures), we first analyze the EPL 2015/16 season which a specific emphasis on the champions Leicester City, and then compare it to the current one. Finally, we show how these features can be used to predict future performance.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128274344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Collecting and Analyzing Millions of mHealth Data Streams 收集和分析数百万的移动医疗数据流
Thomas R Quisel, L. Foschini, Alessio Signorini, David C. Kale
{"title":"Collecting and Analyzing Millions of mHealth Data Streams","authors":"Thomas R Quisel, L. Foschini, Alessio Signorini, David C. Kale","doi":"10.1145/3097983.3098201","DOIUrl":"https://doi.org/10.1145/3097983.3098201","url":null,"abstract":"Players across the health ecosystem are initiating studies of thousands, even millions, of participants to gather diverse types of data, including biomedical, behavioral, and lifestyle in order to advance medical research. These efforts to collect multi-modal data sets on large cohorts coincide with the rise of broad activity and behavior tracking across industries, particularly in healthcare and the growing field of mobile health (mHealth). Government and pharmaceutical sponsored, as well as patient-driven group studies in this arena leverage the ability of mobile technology to continuously track behaviors and environmental factors with minimal participant burden. However, the adoption of mHealth has been constrained by the lack of robust solutions for large-scale data collection in free-living conditions and concerns around data quality. In this work, we describe the infrastructure Evidation Health has developed to collect mHealth data from millions of users through hundreds of different mobile devices and apps. Additionally, we provide evidence of the utility of the data for inferring individual traits pertaining to health, wellness, and behavior. To this end, we introduce and evaluate deep neural network models that achieve high prediction performance without requiring any feature engineering when trained directly on the densely sampled multivariate mHealth time series data. We believe that the present work substantiates both the feasibility and the utility of creating a very large mHealth research cohort, as envisioned by the many large cohort studies currently underway across therapeutic areas and conditions.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128459330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Matching Restaurant Menus to Crowdsourced Food Data: A Scalable Machine Learning Approach 将餐厅菜单与众包食品数据相匹配:一种可扩展的机器学习方法
Hesamoddin Salehian, Patrick D. Howell, Chul Lee
{"title":"Matching Restaurant Menus to Crowdsourced Food Data: A Scalable Machine Learning Approach","authors":"Hesamoddin Salehian, Patrick D. Howell, Chul Lee","doi":"10.1145/3097983.3098125","DOIUrl":"https://doi.org/10.1145/3097983.3098125","url":null,"abstract":"We study the problem of how to match a formally structured restaurant menu item to a large database of less structured food items that has been collected via crowd-sourcing. At first glance, this problem scenario looks like a typical text matching problem that might possibly be solved with existing text similarity learning approaches. However, due to the unique nature of our scenario and the need for scalability, our problem imposes certain restrictions on possible machine learning approaches that we can employ. We propose a novel, practical, and scalable machine learning solution architecture, consisting of two major steps. First we use a query generation approach, based on a Markov Decision Process algorithm, to reduce the time complexity of searching for matching candidates. That is then followed by a re-ranking step, using deep learning techniques, to meet our required matching quality goals. It is important to note that our proposed solution architecture has already been deployed in a real application system serving tens of millions of users, and shows great potential for practical cases of user-entered text to structured text matching, especially when scalability is crucial.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131269433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Backpage and Bitcoin: Uncovering Human Traffickers Backpage和比特币:揭露人贩子
Rebecca S. Portnoff, D. Huang, Periwinkle Doerfler, Sadia Afroz, Damon McCoy
{"title":"Backpage and Bitcoin: Uncovering Human Traffickers","authors":"Rebecca S. Portnoff, D. Huang, Periwinkle Doerfler, Sadia Afroz, Damon McCoy","doi":"10.1145/3097983.3098082","DOIUrl":"https://doi.org/10.1145/3097983.3098082","url":null,"abstract":"Sites for online classified ads selling sex are widely used by human traffickers to support their pernicious business. The sheer quantity of ads makes manual exploration and analysis unscalable. In addition, discerning whether an ad is advertising a trafficked victim or an independent sex worker is a very difficult task. Very little concrete ground truth (i.e., ads definitively known to be posted by a trafficker) exists in this space. In this work, we develop tools and techniques that can be used separately and in conjunction to group sex ads by their true owner (and not the claimed author in the ad). Specifically, we develop a machine learning classifier that uses stylometry to distinguish between ads posted by the same vs. different authors with 90% TPR and 1% FPR. We also design a linking technique that takes advantage of leakages from the Bitcoin mempool, blockchain and sex ad site, to link a subset of sex ads to Bitcoin public wallets and transactions. Finally, we demonstrate via a 4-week proof of concept using Backpage as the sex ad site, how an analyst can use these automated approaches to potentially find human traffickers.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125238993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信