Tim Weninger, Rodrigo Palácios, Valter Crescenzi, Thomas Gottron, P. Merialdo
{"title":"Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future","authors":"Tim Weninger, Rodrigo Palácios, Valter Crescenzi, Thomas Gottron, P. Merialdo","doi":"10.1145/2897350.2897353","DOIUrl":"https://doi.org/10.1145/2897350.2897353","url":null,"abstract":"In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modernWeb pages. Second, it is well understood that wrapper induction extractors tend to break as theWeb changes; ; heuristic/ feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"75 1","pages":"17-23"},"PeriodicalIF":0.0,"publicationDate":"2015-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74203747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New Research Directions in Knowledge Discovery and Allied Spheres","authors":"A. Nica, Fabian M. Suchanek, A. Varde","doi":"10.1145/2783702.2783708","DOIUrl":"https://doi.org/10.1145/2783702.2783708","url":null,"abstract":"The realm of knowledge discovery extends across several allied spheres today. It encompasses database management areas such as data warehousing and schema versioning; information retrieval areas such as Web semantics and topic detection; and core data mining areas, e.g., knowledge based systems, uncertainty management, and time-series mining. This becomes particularly evident in the topics that Ph.D. students choose for their dissertation. As the grass roots of research, Ph.D. dissertations point out new avenues of research, and provide fresh viewpoints on combinations of known fields. In this article we overview some recently proposed developments in the domain of knowledge discovery and its related spheres. Our article is based on the topics presented at the doctoral workshop of the ACM Conference on Information and Knowledge Management, CIKM 2011.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"1 1","pages":"46-49"},"PeriodicalIF":0.0,"publicationDate":"2015-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84152186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. F. Bernardes, M. Diaby, Raphaël Fournier-S’niehotta, F. Fogelman-Soulié, E. Viennet
{"title":"A Social Formalism and Survey for Recommender Systems","authors":"D. F. Bernardes, M. Diaby, Raphaël Fournier-S’niehotta, F. Fogelman-Soulié, E. Viennet","doi":"10.1145/2783702.2783705","DOIUrl":"https://doi.org/10.1145/2783702.2783705","url":null,"abstract":"This paper presents a general formalism for Recommender Systems based on Social Network Analysis. After introducing the classical categories of recommender systems, we present our Social Filtering formalism and show that it extends association rules, classical Collaborative Filtering and Social Recommendation, while providing additional possibilities. This allows us to survey the literature and illustrate the versatility of our approach on various publicly available datasets, comparing our results with the literature.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"4 1","pages":"20-37"},"PeriodicalIF":0.0,"publicationDate":"2015-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78301037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Data Problem in Data Mining","authors":"Albrecht Zimmermann","doi":"10.1145/2783702.2783706","DOIUrl":"https://doi.org/10.1145/2783702.2783706","url":null,"abstract":"Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data. In this position paper, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking insight into three key aspects: 1) How pattern mining algorithms perform quantitatively, 2) How to choose parameter settings, and 3) How to relate found patterns to the processes that generated the data. I illustrate the issue by surveying existing work in light of these concerns and pointing to the (relatively few) papers that have attempted to fill in the gaps. I argue further that progress regarding those questions is held back by a lack of data with varying, controlled properties, and that this lack is unlikely to be remedied by the ever increasing collection of real-life data. Instead, I am convinced that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"21 1","pages":"38-45"},"PeriodicalIF":0.0,"publicationDate":"2015-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81762052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Patent Mining: A Survey","authors":"Longhui Zhang, Lei Li, Tao Li","doi":"10.1145/2783702.2783704","DOIUrl":"https://doi.org/10.1145/2783702.2783704","url":null,"abstract":"Patent documents are important intellectual resources of protecting interests of individuals, organizations and companies. Different from general web documents, patent documents have a well-defined format including frontpage, description, nclaims, and figures. However, they are lengthy and rich in technical terms, which requires enormous human efforts for analysis. Hence, a new research area, called patent mining, emerges in recent years, aiming to assist patent analysts in investigating, processing, and analyzing patent documents. Despite the recent advances in patent mining, it is still far from being well explored in research communities. To help patent analysts and interested readers obtain a big picture of patent mining, we thus provide a systematic summary of existing research efforts along this direction. In this survey, we first present an overview of the technical trend in patent mining. We then investigate multiple research questions related to patent documents, including patent retrieval, patent classification, and patent visualization, and provide summaries and highlights for each question by delving into the corresponding research efforts.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"77 1","pages":"1-19"},"PeriodicalIF":0.0,"publicationDate":"2015-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76162868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, Jiawei Han
{"title":"A Survey on Truth Discovery","authors":"Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, Jiawei Han","doi":"10.1145/2897350.2897352","DOIUrl":"https://doi.org/10.1145/2897350.2897352","url":null,"abstract":"Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"1 1","pages":"1-16"},"PeriodicalIF":0.0,"publicationDate":"2015-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88331015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Références bibliographiques","authors":"H. Pornon","doi":"10.3917/dunod.porno.2015.02.0295","DOIUrl":"https://doi.org/10.3917/dunod.porno.2015.02.0295","url":null,"abstract":"","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74677265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oshini Goonetilleke, T. Sellis, Xiuzhen Zhang, Saket K. Sathe
{"title":"Twitter analytics: a big data management perspective","authors":"Oshini Goonetilleke, T. Sellis, Xiuzhen Zhang, Saket K. Sathe","doi":"10.1145/2674026.2674029","DOIUrl":"https://doi.org/10.1145/2674026.2674029","url":null,"abstract":"With the inception of the Twitter microblogging platform in 2006, a myriad of research efforts have emerged studying different aspects of the Twittersphere. Each study exploits its own tools and mechanisms to capture, store, query and analyze Twitter data. Inevitably, platforms have been developed to replace this ad-hoc exploration with a more structured and methodological form of analysis. Another body of literature focuses on developing languages for querying Tweets. This paper addresses issues around the big data nature of Twitter and emphasizes the need for new data management and query language frameworks that address limitations of existing systems. We review existing approaches that were developed to facilitate twitter analytics followed by a discussion on research issues and technical challenges in developing integrated solutions.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"48 1","pages":"11-20"},"PeriodicalIF":0.0,"publicationDate":"2014-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86621875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Éric Gaussier, Massih-Reza Amini
{"title":"On power law distributions in large-scale taxonomies","authors":"Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Éric Gaussier, Massih-Reza Amini","doi":"10.1145/2674026.2674033","DOIUrl":"https://doi.org/10.1145/2674026.2674033","url":null,"abstract":"In many of the large-scale physical and social complex systems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this paper, we study models of generating power law distributions in the evolution of large-scale taxonomies such as Open Directory Project, which consist of websites assigned to one of tens of thousands of categories. The categories in such taxonomies are arranged in tree or DAG structured configurations having parent-child relations among them. We first quantitatively analyse the formation process of such taxonomies, which leads to power law distribution as the stationary distributions. In the context of designing classifiers for large-scale taxonomies, which automatically assign unseen documents to leaf-level categories, we highlight how the fat-tailed nature of these distributions can be leveraged to analytically study the space complexity of such classifiers. Empirical evaluation of the space complexity on publicly available datasets demonstrates the applicability of our approach.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"30 1","pages":"47-56"},"PeriodicalIF":0.0,"publicationDate":"2014-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81099743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Change detection in streaming data in the era of big data: models and issues","authors":"Dang-Hoan Tran, Mohamed Medhat Gaber, K. Sattler","doi":"10.1145/2674026.2674031","DOIUrl":"https://doi.org/10.1145/2674026.2674031","url":null,"abstract":"Big Data is identified by its three Vs, namely velocity, volume, and variety. The area of data stream processing has long dealt with the former two Vs velocity and volume. Over a decade of intensive research, the community has provided many important research discoveries in the area. The third V of Big Data has been the result of social media and the large unstructured data it generates. Streaming techniques have also been proposed recently addressing this emerging need. However, a hidden factor can represent an important fourth V, that is variability or change. Our world is changing rapidly, and accounting to variability is a crucial success factor. This paper provides a survey of change detection techniques as applied to streaming data. The review is timely with the rise of Big Data technologies, and the need to have this important aspect highlighted and its techniques categorized and detailed.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"124 1","pages":"30-38"},"PeriodicalIF":0.0,"publicationDate":"2014-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78176968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}