{"title":"A term dependency-based approach for query terms ranking","authors":"Chia-Jung Lee, Ruey-Cheng Chen, Shao-Hang Kao, Pu-Jen Cheng","doi":"10.1145/1645953.1646114","DOIUrl":"https://doi.org/10.1145/1645953.1646114","url":null,"abstract":"Formulating appropriate and effective queries has been regarded as a challenging issue, since a large number of candidate words or phrases could be chosen as query terms to convey users' information needs. In this paper, we propose an approach to rank a set of given query terms according their effectiveness, wherein top ranked terms will be selected as an effective query. Our ranking approach exploits and benefits from the underlying relationship between the query terms, and thereby the effective terms can be properly combined into the query. Two regression models which capture a rich set of linguistic and statistical properties are used in our approach. Experiments on NTCIR-4 ad-hoc retrieval tasks demonstrate that the proposed approach can significantly improve retrieval performance, and can be well applied to other problems such as query expansion and querying by text segments.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117213851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatio-temporal association rule mining framework for real-time sensor network applications","authors":"H. Chok, L. Gruenwald","doi":"10.1145/1645953.1646224","DOIUrl":"https://doi.org/10.1145/1645953.1646224","url":null,"abstract":"In this paper, we present a data mining framework to estimate missing or corrupted data in sensor network applications - a frequently occurring phenomenon in this domain. The framework is naturally germane to the spatio-temporal analysis of relational data stream evolution. Our method utilizes association rules to capture spatio-temporal correlations in multivariate, dynamically evolving, and unbounded sensor data streams. Existing approaches that tackled this problem do not account for the multi-dimensionality of the node data and their relationship; furthermore they entail simplistic and/or premature assumptions on the temporal and spatial factors to overcome the complexity of the streaming environment. Our technique, called Mining Autonomously Spatio-Temporal Environmental Rules (MASTER), comprehensively formulates the problem of mining patterns in sensor data streams, and yet remains provably adaptive to bounded time and space costs while probabilistically assuring a bounded estimation error. Simulation experiments show MASTER's efficiency in terms of overhead as well as the quality of estimation.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117222757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingbo Zhu, Huizhen Wang, Benjamin Ka-Yin T'sou, Muhua Zhu
{"title":"Multi-aspect opinion polling from textual reviews","authors":"Jingbo Zhu, Huizhen Wang, Benjamin Ka-Yin T'sou, Muhua Zhu","doi":"10.1145/1645953.1646233","DOIUrl":"https://doi.org/10.1145/1645953.1646233","url":null,"abstract":"This paper presents an unsupervised approach to aspect-based opinion polling from raw textual reviews without explicit ratings. The key contribution of this paper is three-fold. First, a multi-aspect bootstrapping algorithm is proposed to learn from unlabeled data aspect-related terms of each aspect to be used for aspect identification. Second, an unsupervised segmentation model is proposed to address the challenge of identifying multiple single-aspect units in a multi-aspect sentence. Finally, an aspect-based opinion polling algorithm is presented. Experiments on real Chinese restaurant reviews show that our opinion polling method can achieve 75.5% precision performance.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132675559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining data streams with periodically changing distributions","authors":"Yingying Tao, M. Tamer Özsu","doi":"10.1145/1645953.1646065","DOIUrl":"https://doi.org/10.1145/1645953.1646065","url":null,"abstract":"Dynamic data streams are those whose underlying distribution changes over time. They occur in a number of application domains, and mining them is important for these applications. Coupled with the unboundedness and high arrival rates of data streams, the dynamism of the underlying distribution makes data mining challenging. In this paper, we focus on a large class of dynamic streams that exhibit periodicity in distribution changes. We propose a framework, called DMM, for mining this class of streams that includes a new change detection technique and a novel match-and-reuse approach. Once a distribution change is detected, we compare the new distribution with a set of historically observed distribution patterns and use the mining results from the past if a match is detected. Since, for two highly similar distributions, their mining results should also present high similarity, by matching and reusing existing mining results, the overall stream mining efficiency is improved while the accuracy is maintained. Our experimental results confirm this conjecture.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131868458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Berthold, U. Brandes, Tobias Kötter, Martin Mader, U. Nagel, Kilian Thiel
{"title":"Pure spreading activation is pointless","authors":"M. Berthold, U. Brandes, Tobias Kötter, Martin Mader, U. Nagel, Kilian Thiel","doi":"10.1145/1645953.1646264","DOIUrl":"https://doi.org/10.1145/1645953.1646264","url":null,"abstract":"Almost every application of spreading activation is accompanied by its own set of often heuristic restrictions on the dynamics. We show that in constraint-free scenarios spreading activation would actually yield query-independent results, so that the specific choice of restrictions is not only a pragmatic computational issue, but crucially determines the outcome.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132225018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Space-economical partial gram indices for exact substring matching","authors":"N. Tang, Lefteris Sidirourgos, P. Boncz","doi":"10.1145/1645953.1645992","DOIUrl":"https://doi.org/10.1145/1645953.1645992","url":null,"abstract":"Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the positions of all occurrences. Such gram indices are known to provide fast query response time and to allow the index to be created quickly even on huge disk-based datasets. Their main drawback is relatively large storage space, that is a constant multiple (typically >2) of the original data size, even when compression is used. In this work, we study methods to conserve the scalable creation time and efficient exact substring query properties of gram indices, while reducing storage space. To this end, we first propose a partial gram index based on a reduction from the problem of omitting indexed q-grams to the set cover problem. While this method is successful in reducing the size of the index, it generates false positives at query time, reducing efficiency. We then increase the accuracy of partial grams by splitting posting lists of frequent grams in a frequency-tuned set of signatures that take the bytes surrounding the grams into account. The resulting qs-gram scheme is tested on huge collections (up to 426GB) and is shown to achieve an almost 1:1 data:index size, and query performance even faster than normal gram methods, thanks to the reduced size and access cost.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134319884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Pan, Tim Converse, David Ahn, F. Salvetti, Gianluca Donato
{"title":"Feature selection for ranking using boosted trees","authors":"Feng Pan, Tim Converse, David Ahn, F. Salvetti, Gianluca Donato","doi":"10.1145/1645953.1646292","DOIUrl":"https://doi.org/10.1145/1645953.1646292","url":null,"abstract":"Modern search engines have to be fast to satisfy users, so there are hard back-end latency requirements. The set of features useful for search ranking functions, though, continues to grow, making feature computation a latency bottleneck. As a result, not all available features can be used for ranking, and in fact, much of the time, only a small percentage of these features can be used. Thus, it is crucial to have a feature selection mechanism that can find a subset of features that both meets latency requirements and achieves high relevance. To this end, we explore different feature selection methods using boosted regression trees, including both greedy approaches (selecting the features with highest relative importance as computed by boosted trees; discounting importance by feature similarity and a randomized approach. We evaluate and compare these approaches using data from a commercial search engine. The experimental results show that the proposed randomized feature selection with feature-importance-based backward elimination outperforms greedy approaches and achieves a comparable relevance with 30 features to a full-feature model trained with 419 features and the same modeling parameters.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134343426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: KM information extraction II","authors":"R. Wong","doi":"10.1145/3261222","DOIUrl":"https://doi.org/10.1145/3261222","url":null,"abstract":"","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133497800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Injecting purpose and trust into data anonymisation","authors":"Xiaoxun Sun, Hua Wang, Jiuyong Li","doi":"10.1145/1645953.1646166","DOIUrl":"https://doi.org/10.1145/1645953.1646166","url":null,"abstract":"Most existing works of data anonymisation target at the optimization of the anonymisation metrics to balance the data utility and privacy, whereas they ignore the effects of a requester's trust level and application purposes during the data anonymisation. Our aim of this paper is to propose a much finer level anonymisation scheme with regard to the data requester's trust value and specific application purpose. We prioritize the attributes for anonymisation based on how important and critical they are related to the specified application purposes and propose a trust evaluation strategy to quantify the data requester's reliability, and further build the projection between the trust value and the degree of data anonymiztion, which intends to determine to what extent the data should be anonymized. The decomposition algorithm is developed to find the desired anonymous solution, which guarantees the uniqueness and correctness.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133409944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bitmap indexes for relational XML twig query processing","authors":"Kyong-Ha Lee, Bongki Moon","doi":"10.1145/1645953.1646014","DOIUrl":"https://doi.org/10.1145/1645953.1646014","url":null,"abstract":"Due to an increasing volume of XML data, it is considered prudent to store XML data on an industry-strength database system instead of relying on a domain specific application or a file system. For shredded XML data stored in the relational tables, however, it may not be straightforward to apply existing algorithms for twig query processing, because most of the algorithms require XML data to be accessed in a form of streams of elements grouped by their tags and sorted in a particular order. In order to support XML query processing within the common framework of relational database systems, we first propose several bitmap indexes for supporting holistic twig joins on XML data stored in the relational tables. Since bitmap indexes are well supported in most of the commercial and open-source database systems, the proposed bitmap indexes and twig query processing algorithms can be incorporated into the relational query processing framework with more ease. The proposed query processing algorithms are efficient in terms of both time and space, since the compressed bitmap indexes stay compressed during query processing. In addition, we propose a hybrid index which computes twig query solutions with only bit-vectors, without accessing labeled XML elements stored in the relational tables.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133460866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}