Proceedings of the 18th International Workshop on Web and Databases最新文献

筛选
英文 中文
Truth Finding with Attribute Partitioning 属性划分的真值发现
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767118
M. Ba, Roxana Horincar, P. Senellart, Huayu Wu
{"title":"Truth Finding with Attribute Partitioning","authors":"M. Ba, Roxana Horincar, P. Senellart, Huayu Wu","doi":"10.1145/2767109.2767118","DOIUrl":"https://doi.org/10.1145/2767109.2767118","url":null,"abstract":"Truth finding is the problem of determining which of the statements made by contradictory sources is correct, in the absence of prior information on the trustworthiness of the sources. A number of approaches to truth finding have been proposed, from simple majority voting to elaborate iterative algorithms that estimate the quality of sources by corroborating their statements. In this paper, we consider the case where there is an inherent structure in the statements made by sources about real-world objects, that imply different quality levels of a given source on different groups of attributes of an object. We do not assume this structuring given, but instead find it automatically, by exploring and weighting the partitions of the sets of attributes of an object, and applying a reference truth finding algorithm on each subset of the optimal partition. Our experimental results on synthetic and real-world datasets show that we obtain better precision at truth finding than baselines in cases where data has an inherent structure.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124840407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Discovering Subsumption Relationships for Web-Based Ontologies 发现基于web的本体的包容关系
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767111
Dana Movshovitz-Attias, Steven Euijong Whang, Natasha Noy, A. Halevy
{"title":"Discovering Subsumption Relationships for Web-Based Ontologies","authors":"Dana Movshovitz-Attias, Steven Euijong Whang, Natasha Noy, A. Halevy","doi":"10.1145/2767109.2767111","DOIUrl":"https://doi.org/10.1145/2767109.2767111","url":null,"abstract":"As search engines are becoming smarter at interpreting user queries and providing meaningful responses, they rely on ontologies to understand the meaning of entities. Creating ontologies manually is a laborious process, and resulting ontologies may not reflect the way users think about the world, as many concepts used in queries are noisy, and not easily amenable to formal modeling. There has been considerable effort in generating ontologies from Web text and query streams, which may be more reflective of how users query and write content. In this paper, we describe the LATTE system that automatically generates a subconcept--superconcept hierarchy, which is critical for using ontologies to answer queries. LATTE combines signals based on word-vector representations of concepts and dependency parse trees; however, LATTE derives most of its power from an ontology of attributes extracted from the Web that indicates the aspects of concepts that users find important. LATTE achieves an F1 score of 74%, which is comparable to expert agreement on a similar task. We additionally demonstrate the usefulness of LATTE in detecting high quality concepts from an existing resource of IsA links.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122755074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Addressing Instance Ambiguity in Web Harvesting 解决Web收集中的实例歧义
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767114
Zhixu Li, Xiangliang Zhang, Hai Huang, Qing Xie, Jia Zhu, Xiaofang Zhou
{"title":"Addressing Instance Ambiguity in Web Harvesting","authors":"Zhixu Li, Xiangliang Zhang, Hai Huang, Qing Xie, Jia Zhu, Xiaofang Zhou","doi":"10.1145/2767109.2767114","DOIUrl":"https://doi.org/10.1145/2767109.2767114","url":null,"abstract":"Web Harvesting enables the enrichment of incomplete data sets by retrieving required information from the Web. However, the ambiguity of instances may greatly decrease the quality of the harvested data, given that any instance in the local data set may become ambiguous when attempting to identify it on the Web. Although plenty of disambiguation methods have been proposed to deal with the ambiguity problems in various settings, none of them are able to handle the instance ambiguity problem in Web Harvesting. In this paper, we propose to do instance disambiguation in Web Harvesting with a novel disambiguation method inspired by the idea of collaborative identity recognition. In particular, we expect to find some common properties in forms of latent shared attribute values among instances in the list, such that these shared attribute values can differentiate instances within the list against those ambiguous ones on the Web. Our extensive experimental evaluation illustrates the utility of collaborative disambiguation for a popular Web Harvesting application, and shows that it substantially improves the accuracy of the harvested data.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Long-term Optimization of Update Frequencies for Decaying Information 衰减信息更新频率的长期优化
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767113
Simon Razniewski, W. Nutt
{"title":"Long-term Optimization of Update Frequencies for Decaying Information","authors":"Simon Razniewski, W. Nutt","doi":"10.1145/2767109.2767113","DOIUrl":"https://doi.org/10.1145/2767109.2767113","url":null,"abstract":"Many kinds of information, such as addresses, crawls of webpages, or academic affiliations, are prone to becoming outdated over time. Therefore, in some applications, updates are performed periodically in order to keep the correctness and usefulness of such information high. As refreshing information usually has a cost, e.g. computation time, network bandwidth or human work time, a problem is to find the right update frequency depending on the benefit gained from the information and on the speed with which the information is expected to get outdated. This is especially important since often entities exhibit a different speed of getting outdated, as, e.g., addresses of students change more frequently than addresses of pensionists, or news portals change more frequently than personal homepages. Thus, there is no uniform best update frequency for all entities. Previous work [5] on data freshness has focused on the question of how to best distribute a fixed budget for updates among various entities, which is of interest in the short-term, when resources are fixed and cannot be adjusted. In the long-term, many businesses are able to adjust their resources in order to optimize their gain. Then, the problem is not one of distributing a fixed number of updates but one of determining the frequency of updates that optimizes the overall gain from the information. In this paper, we investigate how the optimal update frequency for decaying information can be determined. We show that the optimal update frequency is independent for each entity, and how simple iteration can be used to find the optimal update frequency. An implementation of our solution for exponential decay is available online.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125215544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths FOREST:利用显著标记路径的聚焦对象检索
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767112
Marilena Oita, P. Senellart
{"title":"FOREST: Focused Object Retrieval by Exploiting Significant Tag Paths","authors":"Marilena Oita, P. Senellart","doi":"10.1145/2767109.2767112","DOIUrl":"https://doi.org/10.1145/2767109.2767112","url":null,"abstract":"Content-intensive websites, e.g., of blogs or news, present pages that contain Web articles automatically generated by content management systems. Identification and extraction of their main content is critical in many applications, such as indexing or classification. We present a novel unsupervised approach for the extraction of Web articles from dynamically-generated Web pages. Our system, called Forest, combines structural and information-based features to target the main content generated by a Web source, and published in associated Web pages. We extensively evaluate Forest with respect to various baselines and datasets, and report improved results over state-of-the art techniques in content extraction.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130052576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
TriAL-QL: Distributed Processing of Navigational Queries TriAL-QL:导航查询的分布式处理
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767115
Martin Przyjaciel-Zablocki, A. Schätzle, Adriano Lange
{"title":"TriAL-QL: Distributed Processing of Navigational Queries","authors":"Martin Przyjaciel-Zablocki, A. Schätzle, Adriano Lange","doi":"10.1145/2767109.2767115","DOIUrl":"https://doi.org/10.1145/2767109.2767115","url":null,"abstract":"Navigational queries are among the most natural query patterns for RDF data, but yet most existing RDF query languages fail to cover all the varieties inherent to its triple-based model, including SPARQL 1.1 and its derivatives. As a consequence, the development of more expressive RDF languages is of general interest. With TriAL* [14], there exists an expressive algebra which subsumes many previous approaches, while adding novel features that are not expressible in most other RDF query languages based on the standard graph model. However, its algebraic notation is inappropriate for practical usage and it is not supported by any existing RDF triple store. In this paper, we propose TriAL-QL, an easy to write and grasp language for TriAL*, preserving its compositional algebraic structure. We present an implementation based on Impala, a massive parallel SQL query engine on Hadoop, using an optimized semi-naive evaluation for the recursive fragments of TriAL*. This way, we support both data-intensive ETL-like workloads and explorative ad-hoc style queries. To demonstrate the scalability and expressiveness of our approach, we conducted experiments on generated social networks with up to 1.8 billion triples and compared different execution strategies to a Hive-based solution.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130722328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Person-Name Parsing for Linking User Web Profiles 链接用户Web配置文件的人名解析
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767117
G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang
{"title":"Person-Name Parsing for Linking User Web Profiles","authors":"G. Das, Xiang Li, Ang Sun, Hakan Kardes, Xin Wang","doi":"10.1145/2767109.2767117","DOIUrl":"https://doi.org/10.1145/2767109.2767117","url":null,"abstract":"A person-name parser involves the identification of constituent parts of a person's name. Due to multiple writing styles (\"John Smith\" versus \"Smith, John\"), extra information (\"John Smith, PhD\", \"Rev. John Smith\"), and country-specific last-name prefixes (\"Jean van de Velde\"), parsing fullname strings from user profiles on Web 2.0 applications is not straightforward. To the best of our knowledge, we are the first to address this problem systematically by proposing machine learning approaches for parsing noisy fullname strings. In this paper, we propose several types of features based on token statistics, surface-patterns, and specialized dictionaries and apply them within a sequence modeling framework to learn a fullname parser. In particular, we propose the use of \"bucket\" features based on (name-token, label) distributions in lieu of \"term\" features frequently used in various Natural Language Processing applications to prevent the growth of learning parameters as a function of the training data size. We experimentally illustrate the generalizability, effectiveness, and efficiency aspects of our proposed features for noisy fullname parsing on fullname strings from the popular, professional networking website LinkedIn and commonly-used person names in the United States. On these datasets, our fullname parser significantly outperforms both the parser trained using classification approaches and a commercially-available name parsing solution.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133724461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analyzing Crowd Rankings 人群排名分析
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2767110
Julia Stoyanovich, Marie Jacob, Xuemei Gong
{"title":"Analyzing Crowd Rankings","authors":"Julia Stoyanovich, Marie Jacob, Xuemei Gong","doi":"10.1145/2767109.2767110","DOIUrl":"https://doi.org/10.1145/2767109.2767110","url":null,"abstract":"Ranked data is ubiquitous in real-world applications, arising naturally when users express preferences about products and services, when voters cast ballots in elections, and when funding proposals are evaluated based on their merits or university departments based on their reputation. This paper focuses on crowdsourcing and novel analysis of ranked data. We describe the design of a data collection task in which Amazon MT workers were asked to rank movies. We present results of data analysis, correlating our ranked dataset with IMDb, where movies are rated on a discrete scale rather than ranked. We develop an intuitive measure of worker quality appropriate for this task, where no gold standard answer exists. We propose a model of local structure in ranked datasets, reflecting that subsets of the workers agree in their ranking over subsets of the items, develop a data mining algorithm that identifies such structure, and evaluate in on our dataset. Our dataset is publicly available at https://github.com/stoyanovich/CrowdRank.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125216412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
The elephant in the room: getting value from Big Data 房间里的大象:从大数据中获取价值
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-31 DOI: 10.1145/2767109.2770014
S. Abiteboul, X. Dong, Oren Etzioni, D. Srivastava, G. Weikum, Julia Stoyanovich, Fabian M. Suchanek
{"title":"The elephant in the room: getting value from Big Data","authors":"S. Abiteboul, X. Dong, Oren Etzioni, D. Srivastava, G. Weikum, Julia Stoyanovich, Fabian M. Suchanek","doi":"10.1145/2767109.2770014","DOIUrl":"https://doi.org/10.1145/2767109.2770014","url":null,"abstract":"Big Data, and its 4 Vs – volume, velocity, variety, and veracity – have been at the forefront of societal, scientific and engineering discourse. Arguably the most important 5th V, value, is not talked about as much. How can we make sure that our data is not just big, but also valuable? WebDB 2015 has as its theme “Freshness, Correctness, Quality of Information and Knowledge on the Web”. The workshop attracted 31 submissions, of which the best 9 were selected for presentation at the workshop, and for publication in the proceedings. To set the stage, we have interviewed several prominent members of the data management community, soliciting their opinions on how we can ensure that data is not just available in quantity, but also in quality. In this interview Serge Abiteboul, Oren Etzioni, Divesh Srivastava with Luna Dong, and Gerhard Weikum shared with us their motivation for doing research in the area of data quality, and discussed their current work and their view on the future of the field. This interview appeared as a SIGMOD Blog article.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125766985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
IBEX: Harvesting Entities from the Web Using Unique Identifiers IBEX:使用唯一标识符从Web中获取实体
Proceedings of the 18th International Workshop on Web and Databases Pub Date : 2015-05-04 DOI: 10.1145/2767109.2767116
Aliaksandr Talaika, J. Biega, Antoine Amarilli, Fabian M. Suchanek
{"title":"IBEX: Harvesting Entities from the Web Using Unique Identifiers","authors":"Aliaksandr Talaika, J. Biega, Antoine Amarilli, Fabian M. Suchanek","doi":"10.1145/2767109.2767116","DOIUrl":"https://doi.org/10.1145/2767109.2767116","url":null,"abstract":"In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.","PeriodicalId":316270,"journal":{"name":"Proceedings of the 18th International Workshop on Web and Databases","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116999543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信