C. Bonacic, Danilo Bustos, V. Gil-Costa, Mauricio Marín, Victor Sepulveda
{"title":"Multithreaded Processing in Dynamic Inverted Indexes for Web Search Engines","authors":"C. Bonacic, Danilo Bustos, V. Gil-Costa, Mauricio Marín, Victor Sepulveda","doi":"10.1145/2809948.2809952","DOIUrl":"https://doi.org/10.1145/2809948.2809952","url":null,"abstract":"Processing queries in Web search engines demands the efficient use of hardware resources to cope with the scale and dynamics of user traffic. This paper focuses on the multithreaded processing of queries that requires (1) accessing a large inverted index data structure to obtain a set of documents, (2) rank them by executing the WAND operator in order to obtain the top K most pertinent documents for the query, and (3) resolve the insertion of new documents on the inverted index concurrently with the execution of queries. We propose an efficient strategy to assign threads to queries and index update operations which is suitable to support updates on the index concurrently with query processing. The core of our proposal is a simple classification technique devised to quickly assign threads to query operations.","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"31 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131581084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-Scale Real-Time Data Management for Engagement and Monetization","authors":"Simon Jonassen","doi":"10.1145/2809948.2809953","DOIUrl":"https://doi.org/10.1145/2809948.2809953","url":null,"abstract":"Cxense helps companies understand their audience and build great online experiences. Cxense Insight and DMP let customers annotate, filter, segment and target their users based on the consumed content and performed actions in real-time. With more than 5000 active websites, Insight alone tracks more than a billion unique users with more than 15 billions page views per month. To leverage the huge amounts of data in real-time, we have built a large distributed system relying on techniques familiar from databases, information retrieval and data mining. In this talk, we outline our solutions and give some insight into the technology we use and the challenges we face. This introduction should be interesting to undergraduate and PhD students as well as experienced researchers and engineers.","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"2204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130141575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Morning Session","authors":"B. B. Cambazoglu","doi":"10.1145/3257840","DOIUrl":"https://doi.org/10.1145/3257840","url":null,"abstract":"","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123359610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale Efficient and Effective Video Similarity Search","authors":"M. S. Uysal, C. Beecks, Daniel Sabinasz, T. Seidl","doi":"10.1145/2809948.2809950","DOIUrl":"https://doi.org/10.1145/2809948.2809950","url":null,"abstract":"Recently, the rich diversity of the video capture devices and the high usage of the Internet have generated a great amount of video data, which attracts the attention of researchers with respect to the development of novel effective and efficient video retrieval approaches. In this paper, we investigate the effectiveness and efficiency of the lower-bounding filter distance functions of the well-known similarity measure Earth Mover's Distance (EMD) on signature databases, including the recently introduced Independent Minimization for Signatures (IM-Sig). We conduct the experiments on a public dataset comprising various categories with visually similar videos, and another large-scale real world video dataset consisting of 350,000 near-duplicate videos. To the best of our knowledge, this is the first work investigating the effectiveness and efficiency of the lower-bounding filter distance functions on databases consisting of signatures, i.e adaptive-binned representations. The experimental evaluation indicates both high effectiveness and efficiency of the IM-Sig, outperforming the state-of-the-art techniques.","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128180530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Count or Not to Count: Counting Challenges for Big Spatial Data Analytics","authors":"E. Tanin, Hairuo Xie","doi":"10.1145/2809948.2809954","DOIUrl":"https://doi.org/10.1145/2809948.2809954","url":null,"abstract":"Counts of objects are important for big data analytics. However, spatial objects do not work well with counts. We present the latest developments on distinct counting problem. In particular, we explain Euler Histograms, which are a category of spatial data structures that address the distinct counting challenges. Euler histograms support traditional counting queries as well as other query types.","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133112495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Afternoon Session","authors":"B. B. Cambazoglu","doi":"10.1145/3257841","DOIUrl":"https://doi.org/10.1145/3257841","url":null,"abstract":"","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130986196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Dynamic Index Pruning via Linear Programming","authors":"Simon Jonassen","doi":"10.1145/2809948.2809951","DOIUrl":"https://doi.org/10.1145/2809948.2809951","url":null,"abstract":"Dynamic index pruning techniques are commonly used to speed up query processing in Web search engines. In this work, we propose a linear programming technique which can further improve the performance of the state-of-the-art dynamic index pruning techniques. The experiments we conducted demonstrate that the proposed technique achieves reduction in terms of the disk access, index decompression, and scoring costs compared to the well-known Max-Score technique.","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128853646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Algorithm for Relationship Queries on Large Graphs","authors":"P. Agarwal, Maya Ramanath, Gautam M. Shroff","doi":"10.1145/2809948.2809949","DOIUrl":"https://doi.org/10.1145/2809948.2809949","url":null,"abstract":"Massive-sized graph-structured data is now ubiquitous, e.g., social networks, databases, knowledge-bases, web-graphs, etc. An important class of queries on graph-structured data is \"relationship queries\". Essentially, given a set of entities (corresponding to nodes in the graph), finding a ranked list of interesting interconnections among them. While this problem has been studied for many years, the solutions proposed in the literature so far focus on the non-distributed setting. Clearly, such solutions will not scale with large graphs having billions of nodes and edges that are becoming commonplace. In this paper, we present an algorithm for keyword search on large graphs, which is based on the distributed parallel processing paradigm. We also analyze why our algorithm generates optimal answers. Finally, we report on preliminary empirical results of relationship queries on a subset of the Linked-Open Data graph.","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114344956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","authors":"I. S. Altingovde, B. B. Cambazoglu, N. Tonellotto","doi":"10.1145/2809948","DOIUrl":"https://doi.org/10.1145/2809948","url":null,"abstract":"The publication date is one day earlier then the EST date to provide the proceedings to attendees in Australian on the first day of the conference \u0000 \u0000The LSDS-IR workshop series aims to attract researchers from the industry and academia to present and discuss problems, ideas, and recent research results related to the performance of large-scale and distributed information retrieval systems. The workshop plays an important role in the information retrieval community as a venue where early work addressing the workshop's topics are discussed and matured. The LSDS-IR'15 workshop continues the efforts of the following workshops organized in the past: P2PIR (collocated with SIGIR'05, CIKM'06, and CIKM'07), HDIR (collocated with SIGIR'08), and LSDS-IR (collocated with SIGIR'07, CIKM'08, SIGIR'09, SIGIR'10, CIKM'11, WSDM'13, WSDM'14). As in the previous years, the workshop provides space for researchers to discuss the scalability and efficiency issues in largescale and distributed information retrieval systems and to define new directions for the field. \u0000 \u0000This year's LSDS-IR workshop has attracted five submissions from Europe (Sweden, Germany, Russia), Asia (India), and South America (Chile). Three of these submissions were accepted for presentation as long papers, and one submission was accepted for presentation as short paper. The workshop program also includes the following two invited talks: \u0000\"Large-Scale Real-Time Data Management for Engagement and Monetization\", Simon Jonassen (Cxense), \u0000\"Count or Not to Count: Counting Challenges for Big Spatial Data Analytics\", Egemen Tanin (University of Melbourne).","PeriodicalId":142249,"journal":{"name":"Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123634110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}