Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining最新文献

筛选
英文 中文
An Efficient Bandit Algorithm for Realtime Multivariate Optimization 实时多元优化的高效Bandit算法
Daniel N. Hill, Houssam Nassif, Yi Liu, Anand Iyer, S. Vishwanathan
{"title":"An Efficient Bandit Algorithm for Realtime Multivariate Optimization","authors":"Daniel N. Hill, Houssam Nassif, Yi Liu, Anand Iyer, S. Vishwanathan","doi":"10.1145/3097983.3098184","DOIUrl":"https://doi.org/10.1145/3097983.3098184","url":null,"abstract":"Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of these pages can be decoupled into several separate decisions. For example, the composition of a landing page may involve deciding which image to show, which wording to use, what color background to display, etc. Such optimization is a combinatorial problem over an exponentially large decision space. Randomized experiments do not scale well to this setting, and therefore, in practice, one is typically limited to optimizing a single aspect of a web page at a time. This represents a missed opportunity in both the speed of experimentation and the exploitation of possible interactions between layout decisions Here we focus on multivariate optimization of interactive web pages. We formulate an approach where the possible interactions between different components of the page are modeled explicitly. We apply bandit methodology to explore the layout space efficiently and use hill-climbing to select optimal content in realtime. Our algorithm also extends to contextualization and personalization of layout selection. Simulation results show the suitability of our approach to large decision spaces with strong interactions between content. We further apply our algorithm to optimize a message that promotes adoption of an Amazon service. After only a single week of online optimization, we saw a 21% conversion increase compared to the median layout. Our technique is currently being deployed to optimize content across several locations at Amazon.com.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131119929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 95
Prognosis and Diagnosis of Parkinson's Disease Using Multi-Task Learning 多任务学习对帕金森病预后和诊断的影响
S. Emrani, Anya McGuirk, Wei Xiao
{"title":"Prognosis and Diagnosis of Parkinson's Disease Using Multi-Task Learning","authors":"S. Emrani, Anya McGuirk, Wei Xiao","doi":"10.1145/3097983.3098065","DOIUrl":"https://doi.org/10.1145/3097983.3098065","url":null,"abstract":"Parkinson's disease (PD) is a debilitating neurodegenerative disease excessively affecting millions of patients. Early diagnosis of PD is critical as manifestation of symptoms occur many years after the onset of neurodegenration, when more than 60% of dopaminergic neurons are lost. Since there is no definite diagnosis of PD, the early management of disease is a significant challenge in the field of PD therapeutics. Therefore, identifying valid biomarkers that can characterize the progression of PD has lately received growing attentions in PD research community. In this paper, we employ a multi-task learning regression framework for prediction of Parkinson's disease progression, where each task is the prediction of PD rating scales at one future time point. We then use the model to identify the important biomarkers predictive of disease progression. We adopt a graph regularization approach to capture the relationship between different tasks and penalize large variations of the model at consecutive future time points. We have carried out comprehensive experiments using different categories of measurements at baseline from Parkinson's Progression Markers Initiative (PPMI) database to predict the severity of PD, measured by unified PD rating scale. We use the learned model to identify the biomarkers with significant contribution in prediction of PD progression. Our results confirm some of the important biomarkers identified in existing medical studies, validate some of the biomarkers that have been observed as a potential marker of PD and discover new biomarkers that have not yet been investigated.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122964874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Discovering Pollution Sources and Propagation Patterns in Urban Area 发现城市地区的污染源和传播模式
Xiucheng Li, Yun Cheng, G. Cong, Lisi Chen
{"title":"Discovering Pollution Sources and Propagation Patterns in Urban Area","authors":"Xiucheng Li, Yun Cheng, G. Cong, Lisi Chen","doi":"10.1145/3097983.3098090","DOIUrl":"https://doi.org/10.1145/3097983.3098090","url":null,"abstract":"Air quality is one of the most important environmental concerns in the world, and it has deteriorated substantially over the past years in many countries. For example, Chinese Academy of Social Sciences reports that the problem of haze and fog in China is hitting a record level, and China is currently suffering from the worst air pollution. Among the various causal factors of air quality, particulate matter with a diameter of 2.5 micrometers or less (i.e., PM2.5) is a very important factor; governments and people are increasingly concerned with the concentration of PM2.5. In many cities, stations for monitoring PM2.5 concentration have been built by governments or companies to monitor urban air quality. Apart from monitoring, there is a rising demand for finding pollution sources of PM2.5 and discovering the transmission of PM2.5 based on the data from PM$_{2.5}$ monitoring stations. However, to the best of our knowledge, none of previous work proposes a solution to the problem of detecting pollution sources and mining pollution propagation patterns from such monitoring data. In this work, we propose the first solution for the problem, which comprises two steps. The first step is to extract the uptrend intervals and calculate the causal strengths among spatially distributed sensors; The second step is to construct causality graphs and perform frequent subgraphs mining on these causality graphs to find pollution sources and propagation patterns. We use real-life monitoring data collected by a company in our experiments. Our experimental results demonstrate significant findings regarding pollution sources and pollutant propagations in Beijing, which will be useful for governments to make policy and govern pollution sources.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130338013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Finding Precursors to Anomalous Drop in Airspeed During a Flight's Takeoff 在飞机起飞时发现空速异常下降的前兆
V. Janakiraman, B. Matthews, N. Oza
{"title":"Finding Precursors to Anomalous Drop in Airspeed During a Flight's Takeoff","authors":"V. Janakiraman, B. Matthews, N. Oza","doi":"10.1145/3097983.3098097","DOIUrl":"https://doi.org/10.1145/3097983.3098097","url":null,"abstract":"Aerodynamic stall based loss of control in flight is a major cause of fatal flight accidents. In a typical takeoff, a flight's airspeed continues to increase as it gains altitude. However, in some cases, the airspeed may drop immediately after takeoff and when left uncorrected, the flight gets close to a stall condition which is extremely risky. The takeoff is a high workload period for the flight crew involving frequent monitoring, control and communication with the ground control tower. Although there exists secondary safety systems and specialized recovery maneuvers, current technology is reactive; often based on simple threshold detection and does not provide the crew with sufficient lead time. Further, with increasing complexity of automation, the crew may not be aware of the true states of the automation to take corrective actions in time. At NASA, we aim to develop decision support tools by mining historic flight data to proactively identify and manage high risk situations encountered in flight. In this paper, we present our work on finding precursors to the anomalous drop-in-airspeed (ADA) event using the ADOPT (Automatic Discovery of Precursors in Time series) algorithm. ADOPT works by converting the precursor discovery problem into a search for sub-optimal decision making in the time series data, which is modeled using reinforcement learning. We give insights about the flight data, feature selection, ADOPT modeling and results on precursor discovery. Some improvements to ADOPT algorithm are implemented that reduces its computational complexity and enables forecasting of the adverse event. Using ADOPT analysis, we have identified some interesting precursor patterns that were validated to be operationally significant by subject matter experts. The performance of ADOPT is evaluated by using the precursor scores as features to predict the drop in airspeed events.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132314560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Compass: Spatio Temporal Sentiment Analysis of US Election What Twitter Says! 指南针:美国大选的时空情感分析——推特在说什么!
Debjyoti Paul, Feifei Li, M. K. Teja, Xin Yu, R. Frost
{"title":"Compass: Spatio Temporal Sentiment Analysis of US Election What Twitter Says!","authors":"Debjyoti Paul, Feifei Li, M. K. Teja, Xin Yu, R. Frost","doi":"10.1145/3097983.3098053","DOIUrl":"https://doi.org/10.1145/3097983.3098053","url":null,"abstract":"With the widespread growth of various social network tools and platforms, analyzing and understanding societal response and crowd reaction to important and emerging social issues and events through social media data is increasingly an important problem. However, there are numerous challenges towards realizing this goal effectively and efficiently, due to the unstructured and noisy nature of social media data. The large volume of the underlying data also presents a fundamental challenge. Furthermore, in many application scenarios, it is often interesting, and in some cases critical, to discover patterns and trends based on geographical and/or temporal partitions, and keep track of how they will change overtime. This brings up the interesting problem of spatio-temporal sentiment analysis from large-scale social media data. This paper investigates this problem through a data science project called \"US Election 2016, What Twitter Says\". The objective is to discover sentiment on Twitter towards either the democratic or the republican party at US county and state levels over any arbitrary temporal intervals, using a large collection of geotagged tweets from a period of 6 months leading up to the US Presidential Election in 2016. Our results demonstrate that by integrating and developing a combination of machine learning and data management techniques, it ispossible to do this at scale with effective outcomes. The results of our project have the potential to be adapted towards solving and influencing other interesting social issues such as building neighborhood happiness and health indicators.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120933689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Anomaly Detection in Streams with Extreme Value Theory 基于极值理论的流异常检测
Alban Siffer, Pierre-Alain Fouque, A. Termier, C. Largouët
{"title":"Anomaly Detection in Streams with Extreme Value Theory","authors":"Alban Siffer, Pierre-Alain Fouque, A. Termier, C. Largouët","doi":"10.1145/3097983.3098144","DOIUrl":"https://doi.org/10.1145/3097983.3098144","url":null,"abstract":"Anomaly detection in time series has attracted considerable attention due to its importance in many real-world applications including intrusion detection, energy management and finance. Most approaches for detecting outliers rely on either manually set thresholds or assumptions on the distribution of data according to Chandola, Banerjee and Kumar. Here, we propose a new approach to detect outliers in streaming univariate time series based on Extreme Value Theory that does not require to hand-set thresholds and makes no assumption on the distribution: the main parameter is only the risk, controlling the number of false positives. Our approach can be used for outlier detection, but more generally for automatically setting thresholds, making it useful in wide number of situations. We also experiment our algorithms on various real-world datasets which confirm its soundness and efficiency.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125053354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 270
Optimized Risk Scores 优化风险评分
Berk Ustun, C. Rudin
{"title":"Optimized Risk Scores","authors":"Berk Ustun, C. Rudin","doi":"10.1145/3097983.3098161","DOIUrl":"https://doi.org/10.1145/3097983.3098161","url":null,"abstract":"Risk scores are simple classification models that let users quickly assess risk by adding, subtracting, and multiplying a few small numbers. Such models are widely used in healthcare and criminal justice, but are often built ad hoc. In this paper, we present a principled approach to learn risk scores that are fully optimized for feature selection, integer coefficients, and operational constraints. We formulate the risk score problem as a mixed integer nonlinear program, and present a new cutting plane algorithm to efficiently recover its optimal solution. Our approach can fit optimized risk scores in a way that scales linearly with the sample size of a dataset, provides a proof of optimality, and obeys complex constraints without parameter tuning. We illustrate these benefits through an extensive set of numerical experiments, and an application where we build a customized risk score for ICU seizure prediction.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127243613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Discovering Enterprise Concepts Using Spreadsheet Tables 使用电子表格发现企业概念
Keqian Li, Yeye He, Kris Ganjam
{"title":"Discovering Enterprise Concepts Using Spreadsheet Tables","authors":"Keqian Li, Yeye He, Kris Ganjam","doi":"10.1145/3097983.3098102","DOIUrl":"https://doi.org/10.1145/3097983.3098102","url":null,"abstract":"Existing work on knowledge discovery focuses on using natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and are often well-structured with coherent data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of enterprise spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tre based on co-occurrence statistics. We then \"summarize\" the large tree by selecting important tree nodes that are likely good concepts based on how well they \"describe\" the original corpus. The result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce and to scale to large corpus. Experiments using real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134212797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity 加密恶意软件流量分类的机器学习:考虑噪声标签和非平稳性
Blake Anderson, D. McGrew
{"title":"Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity","authors":"Blake Anderson, D. McGrew","doi":"10.1145/3097983.3098163","DOIUrl":"https://doi.org/10.1145/3097983.3098163","url":null,"abstract":"The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134063210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 184
TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks TensorFlow估计器:在高级机器学习框架中管理简单性与灵活性
Heng-Tze Cheng, Zakaria Haque, Lichan Hong, M. Ispir, Clemens Mewald, Illia Polosukhin, Georgios Roumpos, D. Sculley, Jamie Smith, David Soergel, Yuan Tang, Philipp Tucker, M. Wicke, Cassandra Xia, Jianwei Xie
{"title":"TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks","authors":"Heng-Tze Cheng, Zakaria Haque, Lichan Hong, M. Ispir, Clemens Mewald, Illia Polosukhin, Georgios Roumpos, D. Sculley, Jamie Smith, David Soergel, Yuan Tang, Philipp Tucker, M. Wicke, Cassandra Xia, Jianwei Xie","doi":"10.1145/3097983.3098171","DOIUrl":"https://doi.org/10.1145/3097983.3098171","url":null,"abstract":"We present a framework for specifying, training, evaluating, and deploying machine learning models. Our focus is on simplifying cutting edge machine learning for practitioners in order to bring such technologies into production. Recognizing the fast evolution of the field of deep learning, we make no attempt to capture the design space of all possible model architectures in a domain-specific language (DSL) or similar configuration language. We allow users to write code to define their models, but provide abstractions that guide developers to write models in ways conducive to productionization. We also provide a unifying Estimator interface, making it possible to write downstream infrastructure (e.g. distributed training, hyperparameter tuning) independent of the model implementation. We balance the competing demands for flexibility and simplicity by offering APIs at different levels of abstraction, making common model architectures available out of the box, while providing a library of utilities designed to speed up experimentation with model architectures. To make out of the box models flexible and usable across a wide range of problems, these canned Estimators are parameterized not only over traditional hyperparameters, but also using feature columns, a declarative specification describing how to interpret input data We discuss our experience in using this framework in research and production environments, and show the impact on code health, maintainability, and development speed.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128728391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信