Zhe Jiang, Wenchong He, M. Kirby, S. Asiri, Dan Yan
{"title":"Weakly Supervised Spatial Deep Learning based on Imperfect Vector Labels with Registration Errors","authors":"Zhe Jiang, Wenchong He, M. Kirby, S. Asiri, Dan Yan","doi":"10.1145/3447548.3467301","DOIUrl":"https://doi.org/10.1145/3447548.3467301","url":null,"abstract":"This paper studies weakly supervised learning on spatial raster data based on imperfect vector training labels. Given raster feature imagery and imperfect (weak) vector labels with location registration errors, our goal is to learn a deep learning model for pixel classification and refine vector labels simultaneously. The problem is important in many geoscience applications such as streamline delineation and road mapping from earth imagery, where annotating imperfect coarse vector labels is far more efficient than drawing precise labels. But the problem is challenging due to the misalignment of vector labels with raster feature pixels and the need to infer true vector label location while learning neural network parameters. Existing works on weakly supervised learning often focus on noise and errors in label semantics, assuming label locations to be either correct or irrelevant (e.g., identical and independently distributed). A few works exist on label registration errors, but these methods often focus on label misalignment on object segment boundaries at the pixel level without guaranteeing vector continuity. To fill the gap, this paper proposes a spatial learning framework based on Expectation-Maximization that iteratively updates deep neural network parameters while inferring true vector label locations. Specifically, inference of true vector locations is based on both the current pixel class predictions and the geometric properties of vectors. Evaluations on real-world high-resolution remote sensing datasets in National Hydrography Dataset (NHD) refinement show that the proposed framework outperforms baseline methods in classification accuracy and refined vector quality.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121750373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanjie Zheng, P. Jenkins, Yanyan Xu, Dongyao Chen
{"title":"Overview of the 1st Workshop on City Brain Research","authors":"Guanjie Zheng, P. Jenkins, Yanyan Xu, Dongyao Chen","doi":"10.1145/3447548.3469481","DOIUrl":"https://doi.org/10.1145/3447548.3469481","url":null,"abstract":"The 1st Workshop on City Brain Research examines the current challenges and recent breakthroughs related to intelligent urban transportation. The workshop will be organized in a novel form --- offering debates on three main components involved in the transportation policy development cycle: data collection, policy learning, and the effects on human behavior. The organizers intend to invite speakers and attendees from different backgrounds, ranging from computer science, transportation, to urban planning. The final outcomes include live discussions of the three consistent topics, a comprehensive annual report summarizing current practices and future directions, and a detailed tutorial on the workshop day.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"503 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127593313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unpaired Generative Molecule-to-Molecule Translation for Lead Optimization","authors":"Guy Barshatski, Kira Radinsky","doi":"10.1145/3447548.3467120","DOIUrl":"https://doi.org/10.1145/3447548.3467120","url":null,"abstract":"Molecular lead optimization is an important task of drug discovery focusing on generating novel molecules similar to a drug candidate but with enhanced properties. Prior works focused on supervised models requiring datasets of pairs of a molecule and an enhanced molecule. These approaches require large amounts of data and are limited by the bias of the specific examples of enhanced molecules. In this work, we present an unsupervised generative approach with a molecule-embedding component that maps a discrete representation of a molecule to a continuous space. The components are then coupled with a unique training architecture leveraging molecule fingerprints and applying double cycle constraints to enable both chemical resemblance to the original molecular lead while generating novel molecules with enhanced properties. We evaluate our method on multiple common molecular optimization tasks, including dopamine receptor (DRD2) and drug likeness (QED), and show our method outperforms previous state-of-the-art baselines. Moreover, we conduct thorough ablation experiments to show the effect and necessity of important components in our model. Furthermore, we demonstrate our method's ability to generate FDA-approved drugs it has never encountered before, such as Perazine and Clozapine, which are used to treat psychotic disorders, like Schizophrenia. The system is currently being deployed for use in the Targeted Drug Delivery and Personalized Medicine laboratories generating treatments using nanoparticle-based technology.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128048407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MLHat: Deployable Machine Learning for Security Defense","authors":"Gang Wang, A. Ciptadi, Aliakbar Ahmadzadeh","doi":"10.1145/3447548.3469463","DOIUrl":"https://doi.org/10.1145/3447548.3469463","url":null,"abstract":"The MLHat workshop aims to bring together academic researchers and industry practitioners to discuss the open challenges, potential solutions, and best practices to deploy machine learning at scale for security defense. The workshop will discuss related topics from both defender perspectives (white-hat) and the attacker perspectives (black-hat). We call the workshop MLHats, to serve as a place for people who are interested in using machine learning to solve practical security problems. The workshop will focus on defining new machine learning paradigms under various security application contexts and identifying exciting new future research directions. At the same time, the workshop will also have a strong industry presence to provide insights into the challenges in deploying and maintaining machine learning models and the much-needed discussion on the capabilities that the state-of-the-arts failed to provide.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115773708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Computing a Near-Maximum Weighted Independent Set on Massive Graphs","authors":"Jiewei Gu, Weiguo Zheng, Yuzheng Cai, Peng Peng","doi":"10.1145/3447548.3467232","DOIUrl":"https://doi.org/10.1145/3447548.3467232","url":null,"abstract":"The vertices in many graphs are weighted unequally in real scenarios, but the previous studies on the maximum independent set (MIS) ignore the weights of vertices. Therefore, the weight of an MIS may not necessarily be the largest. In this paper, we study the problem of maximum weighted independent set (MWIS) that is defined as the set of independent vertices with the largest weight. Since it is intractable to deliver the exact solution for large graphs, we design a reducing and tie-breaking framework to compute a near-maximum weighted independent set. The reduction rules are critical to reduce the search space for both exact and greedy algorithms as they determine the vertices that are definitely (or not) in the MWIS while preserving the correctness of solutions. We devise a set of novel reductions including low-degree reductions and high-degree reductions for general weighted graphs. Extensive experimental studies over real graphs confirm that our proposed method outperforms the state-of-the-arts significantly in terms of both effectiveness and efficiency.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115854943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Han, D. Burdick, David Bruce Lewis, Yijuan Lu, Hamid Motahari, Sandeep Tata
{"title":"DI-2021: The Second Document Intelligence Workshop","authors":"Benjamin Han, D. Burdick, David Bruce Lewis, Yijuan Lu, Hamid Motahari, Sandeep Tata","doi":"10.1145/3447548.3469454","DOIUrl":"https://doi.org/10.1145/3447548.3469454","url":null,"abstract":"Business documents are central to the operation of all organizations, and they come in all shapes and sizes: project reports, planning documents, technical specifications, financial statements, meeting minutes, legal agreements, contracts, resumes, purchase orders, invoices, and many more. The ability to read, understand and interpret these documents, referred to here as Document Intelligence (DI), is challenging due to not only many domains of knowledge involved, but also their complex formats and structures, internal and external cross references deployed, and even less-than-ideal quality of scans and OCR oftentimes performed on them. This workshop aims to explore and advance the current state of research and practice in answering these challenges.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131360541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Márcia Barros, Francisco M. Couto, Matilde Pato, Pedro Ruas
{"title":"Creating Recommender Systems Datasets in Scientific Fields","authors":"Márcia Barros, Francisco M. Couto, Matilde Pato, Pedro Ruas","doi":"10.1145/3447548.3470805","DOIUrl":"https://doi.org/10.1145/3447548.3470805","url":null,"abstract":"Recommender systems (RS) have been successfully explored in a vast number of domains, e.g. movies and tv shows, music, or e-commerce. In these domains we have a large number of datasets freely available for testing and evaluating new recommender algorithms. For example, Movielens and Netflix datasets for movies, Spotify for music, and Amazon for e-commerce, which translates into a large number of algorithms applied to these fields. In scientific fields, such as Health and Chemistry, standard and open access datasets with the information about the preferences of the users are scarce. First, it is important to understand the application domain, i.e. \"what the recommended item is\". Second, who are the end users: researchers, pharmacists, clinicians or policy makers. Third, the availability of data. Thus, if we wish to develop an algorithm for recommending scientific items, we do not have access to datasets with information about the past preferences of a group of users. Given this limitation, we developed a methodology, called LIBRETTI - LIterature Based RecommEndaTion of scienTific Items, whose goal is the creation of datasets, related with scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. More info about the tutorial at https://lasigebiotm.github.io/RecSys.Scifi/.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130452990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue He, Yancheng Dong, Peng Cui, Yuhang Jiao, Xiaowei Wang, Ji Liu, Philip S. Yu
{"title":"Purify and Generate: Learning Faithful Item-to-Item Graph from Noisy User-Item Interaction Behaviors","authors":"Yue He, Yancheng Dong, Peng Cui, Yuhang Jiao, Xiaowei Wang, Ji Liu, Philip S. Yu","doi":"10.1145/3447548.3467205","DOIUrl":"https://doi.org/10.1145/3447548.3467205","url":null,"abstract":"Matching is almost the first and most fundamental step in recommender systems, that is to quickly select hundreds or thousands of related entities from the whole commodity pool. Among all the matching methods, item-to-item (I2I) graph based matching is a handy and highly effective approach and is widely used in most applications, owing to the essential relationships of entities described in a powerful I2I graph. Yet, the I2I graph is not a ready-made product in a data source. To obtain it from users' behaviors, a common practice in the industry is to construct the graph based on the similarity of item embeddings or co-occurrence frequency directly. However, these methods tend to lose the complicated correlations (high-ordered or nonlinear) inside decision-making actions and cannot achieve the global optimal solution. Moreover, the correlations between items are usually contained in users' short-term actions, which are full of noise information (e.g. spurious association, missing connection). It is vitally important to filter out noise while generating the graph. In this paper, we propose a novel framework called Purified Graph Generation (PGG) dedicated to learn faithful I2I graph from sparse and noisy behavior data. We capture the 'confidence value' between user and item to get rid of exception action during decision making, and leverage it to re-sample purified sets that are fed into an unsupervised I2I graph structure learning framework called GPBG. Extensive experimental results from both simulation and real data demonstrate that our method could significantly benefit the performance of I2I graph compared to the typical baselines.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134131775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Generative Models for Spatial Networks","authors":"Xiaojie Guo, Yuanqi Du, Liang Zhao","doi":"10.1145/3447548.3467394","DOIUrl":"https://doi.org/10.1145/3447548.3467394","url":null,"abstract":"Spatial networks represent crucial data structures where the nodes and edges are embedded in a geometric space. Nowadays, spatial network data is becoming increasingly popular and important, ranging from microscale (e.g., protein structures), to middle-scale (e.g., biological neural networks), to macro-scale (e.g., mobility networks). Although, modeling and understanding the generative process of spatial networks are very important, they remain largely under-explored due to the significant challenges in automatically modeling and distinguishing the independency and correlation among various spatial and network factors. To address these challenges, we first propose a novel objective for joint spatial-network disentanglement from the perspective of information bottleneck as well as a novel optimization algorithm to optimize the intractable objective. Based on this, a spatial-network variational autoencoder (SND-VAE) with a new spatial-network message passing neural network (S-MPNN) is proposed to discover the independent and dependent latent factors of spatial and networks. Qualitative and quantitative experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed model over the state-of-the-arts by up to 66.9% for graph generation and 37.3% for interpretability.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133915628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ODD: Outlier Detection and Description","authors":"Siddharth Bhatia, Bryan Hooi, L. Akoglu, Sourav Chatterjee, Xiaodong Jiang, Manish Gupta","doi":"10.1145/3447548.3469483","DOIUrl":"https://doi.org/10.1145/3447548.3469483","url":null,"abstract":"We propose to organize the 6th ODD workshop at KDD 2021, following the successful series of the past five ODD Workshops that have been organized at KDD 2013, KDD 2014, KDD 2015, KDD 2016, and KDD 2018.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"378 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131767378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}