Yang Sun, Ziming Zhuang, Isaac G. Councill, C. Lee Giles
{"title":"Determining Bias to Search Engines from Robots.txt","authors":"Yang Sun, Ziming Zhuang, Isaac G. Councill, C. Lee Giles","doi":"10.1109/WI.2007.45","DOIUrl":"https://doi.org/10.1109/WI.2007.45","url":null,"abstract":"Search engines largely rely on robots (i.e., crawlers or spiders) to collect information from the Web. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Ethical robots will follow the rules specified in robots.txt. Websites can explicitly specify an access preference for each robot by name. Such biases may lead to a \"rich get richer\" situation, in which a few popular search engines ultimately dominate the Web because they have preferred access to resources that are inaccessible to others. This issue is seldom addressed, although the robots.txt convention has become a de facto standard for robot regulation and search engines have become an indispensable tool for information access. We propose a metric to evaluate the degree of bias to which specific robots are subjected. We have investigated 7,593 websites covering education, government, news, and business domains, and collected 2,925 distinct robots.txt files. Results of content and statistical analysis of the data confirm that the robots of popular search engines and information portals, such as Google, Yahoo, and MSN, are generally favored by most of the websites we have sampled. The results also show a strong correlation between the search engine market share and the bias toward particular search engine robots.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127665656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concordance-Based Entity-Oriented Search","authors":"Mikhail Bautin, S. Skiena","doi":"10.1109/WI.2007.37","DOIUrl":"https://doi.org/10.1109/WI.2007.37","url":null,"abstract":"We consider the problem of finding the relevant named entities in response to a search query over a given text corpus. Entity search can readily be used to augment conventional web search engines for a variety of applications. To assess the significance of entity search, we analyzed the AOL dataset of 36 million web search queries with respect to two different sets of entities: namely (a) 2.3 million distinct entities extracted from a news text corpus and (b) 2.9 million Wikipedia article titles. The results clearly indicate that search engines should be aware of entities, for under various criteria of matching between 18-39% of all web search queries can be recognized as specifically searching for entities, while 73-87% of all queries contain entities. Our entity search engine creates a concordance document for each entity, consisting of all the sentences in the corpus containing that entity. We then index and search these documents using open-source search software. This gives a ranked list of entities as the result of search. Visit http://www.textmap.com for a demonstration of our entity search engine over a large news corpus. We evaluate our system by comparing the results of each query to the list of entities that have highest statistical juxtaposition scores with the queried entity. Juxtaposition score is a measure of how strongly two entities are related in terms of a probabilistic upper bound. The results show excellent performance, particularly over well-characterized classes of entities such as people.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132754928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending Description Logic for Reasoning about Ontology Evolution","authors":"Chuming Chen, M. Matthews","doi":"10.1109/WI.2007.53","DOIUrl":"https://doi.org/10.1109/WI.2007.53","url":null,"abstract":"Ontologies play a key role in achieving global automatic information integration and sharing on the Semantic Web. They allow intelligent applications to exchange information through a shared and formal conceptualization of an application domain. Understanding ontology evolution can help both ontology developers and users evaluating the potential consequences of ontology changes and act accordingly. Our contribution is proposing a temporal paradigm for ontology evolution and extending Description Logic with Temporal Logic operators to formally characterize and reason about ontology evolution. We investigate related reasoning problems and algorithm.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114074299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"You Can't Always Get What You Want: Achieving Differentiated Service Levels with Pricing Agents in a Storage Grid","authors":"H. H. Huang, A. Grimshaw, John F. Karpovich","doi":"10.1109/WI.2007.117","DOIUrl":"https://doi.org/10.1109/WI.2007.117","url":null,"abstract":"We have designed a new storage grid called Storage@desk to harness unused storage available on desktop machines and turn it into a useful resource for clients. Given the complexity of managing clientspecific QoS requirements, and the dynamism inherent in supply and demand for resources, even a highly experienced system administrator cannot effectively manage resource allocation. In this paper, we present a market-based resource allocation model where pricing agents help resource providers adjust the prices as demand fluctuates. With derivative-following pricing, an agent requires no knowledge of competitors or consumers, which reduces communication overheads and avoids bottlenecks in the system. Individual clients need a variety of service levels and are in competition in scarce resources. Under the budget constraints, the consumers can't always get what they want. The budgets serve as an incentive for the consumers to react to the price signals. We simulate our model using real world trace data and the results show that, using this model, the system allows the consumers to achieve QoS goals under sufficient budgets and degrade in accordance with relative budget amounts.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"25 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114234203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Novel IR Measures to Learn Optimal Cluster Structures for Web Information Retrieval","authors":"Martin Mehlitz, Jérôme Kunegis, S. Albayrak","doi":"10.1109/WI.2007.107","DOIUrl":"https://doi.org/10.1109/WI.2007.107","url":null,"abstract":"The Internet is a vast resource of information. Unfortunately, finding and accessing this information is often a very cumbersome task even with existing information platforms. Searching on the WWW suffers from the fact that almost every word is ambiguous to a certain degree in the information-rich environment of the Internet. Clustering search results is a way to solve this problem. This paper demonstrates how to employ novel Information Retrieval measures to derive optimal parametrizations for a cluster algorithm.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123850692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. F. Chikhi, B. Rothenburger, Nathalie Aussenac-Gilles
{"title":"A Comparison of Dimensionality Reduction Techniques for Web Structure Mining","authors":"N. F. Chikhi, B. Rothenburger, Nathalie Aussenac-Gilles","doi":"10.1109/WI.2007.6","DOIUrl":"https://doi.org/10.1109/WI.2007.6","url":null,"abstract":"In many domains, dimensionality reduction techniques have been shown to be very effective for elucidating the underlying semantics of data. Thus, in this paper we investigate the use of various dimensionality reduction techniques (DRTs) to extract the implicit structures hidden in the Web hyperlink connectivity. We apply and compare four DRTs, namely, principal component analysis (PCA), non-negative matrix factorization (NMF), independent component analysis (ICA) and random projection (RP). Experiments conducted on three datasets allow us to assert the following: NMF outperforms PCA and ICA in terms of stability and interpretability of the discovered structures; the well- known WebKb dataset used in a large number of works about the analysis of the hyperlink connectivity seems to be not adapted for this task and we suggest rather to use the recent Wikipedia dataset which is better suited.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125816758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perseus -- A Personalized Reputation System","authors":"P. Nurmi","doi":"10.1109/WI.2007.144","DOIUrl":"https://doi.org/10.1109/WI.2007.144","url":null,"abstract":"We propose Perseus, a personalized reputation system. In Perseus, reputations comprise of three aspects: how much I personally trust another individual, how trustworthy others think the individual is, and how much I trust the opinions of others. Perseus is adaptive in the sense that user feedback is used to modify the way the different aspects are considered. We also present simulation experiments, which indicate that Perseus is robust and able to survive under extreme conditions of misbehavior. In addition, Perseus encourages individuals to rate the other party and give fair ratings. We also compare Perseus against other well-known reputation systems.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115904335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samer Abdul Ghafour, P. Ghodous, B. Shariat, E. Perna
{"title":"A Common Design-Features Ontology for Product Data Semantics Interoperability","authors":"Samer Abdul Ghafour, P. Ghodous, B. Shariat, E. Perna","doi":"10.1109/WI.2007.5","DOIUrl":"https://doi.org/10.1109/WI.2007.5","url":null,"abstract":"In a collaborative design environment, various software tools are utilized to enhance the product development. This entails a meaningful representation and exchange of product data semantics across these different systems. Semantic interoperability of product information refers to enabling the exchange of design intelligence, including construction history, parameters, features, and constraints. This is a crucial difference compared to current standards such as STEP that deliver \"dumb\" geometry, where no design intent is associated. To enable semantics data exchange, we propose an ontology-based approach, consisting in developing a \"Common Design Features Ontology\", called CDFO. Interoperability among ontologies is fulfilled by defining several mapping rules. We use a descriptive logic-based language, notably OWL DL to represent formally our ontology.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130409503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correct your text with Google","authors":"Stéphanie Jacquemont, F. Jacquenet, M. Sebban","doi":"10.1109/WI.2007.41","DOIUrl":"https://doi.org/10.1109/WI.2007.41","url":null,"abstract":"With the increasing amount of text files that are produced nowadays, spell checkers have become essential tools for everyday tasks of millions of end users. Among the years, several tools have been designed that show decent performances. Of course, grammatical checkers may improve corrections of texts, nevertheless, this requires large resources. We think that basic spell checking may be improved (a step towards) using the Web as a corpus and taking into account the context of words that are identified as potential misspellings. We propose to use the Google search engine and some machine learning techniques, in order to design a flexible and dynamic spell checker that may evolve among the time with new linguistic features.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125326172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Didactic-based Model of Scenarios for Designing an Adaptive and Context-Aware Learning System","authors":"Jean-Louis Tetchueng, Serge Garlatti, S. Laubé","doi":"10.1109/WI.2007.118","DOIUrl":"https://doi.org/10.1109/WI.2007.118","url":null,"abstract":"Nowadays, technology-enhanced learning systems must have the ability to deal with the context and to allow dynamic adaptation based on pedagogical theories and knowledge models. The main issue is to design a generic scenario to deal with the broadest range of learning situations. From a generic scenario, the learning system will compute on the fly a scenario adapted to the current learner and its situation. Our main contribution is a semantic and didactic-based model of scenarios to design an adaptive and context- aware learning system. The scenario model is acquired from: i) the know-how and real practices of teachers ii) the theory in didactic anthropology of knowledge of Chevallard [1]; iii) a hierarchical task model.","PeriodicalId":192501,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'07)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124794965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}