在统计数据库中寻找真相

Proceedings of the 21st International Workshop on the Web and Databases Pub Date : 2018-06-10 DOI:10.1145/3201463.3201467

Tien-Duc Cao, I. Manolescu, Xavier Tannier

{"title":"在统计数据库中寻找真相","authors":"Tien-Duc Cao, I. Manolescu, Xavier Tannier","doi":"10.1145/3201463.3201467","DOIUrl":null,"url":null,"abstract":"The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by (i) identifying the statistic datasets most relevant for a given fact-checking query, and (ii) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.","PeriodicalId":365496,"journal":{"name":"Proceedings of the 21st International Workshop on the Web and Databases","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Searching for Truth in a Database of Statistics\",\"authors\":\"Tien-Duc Cao, I. Manolescu, Xavier Tannier\",\"doi\":\"10.1145/3201463.3201467\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by (i) identifying the statistic datasets most relevant for a given fact-checking query, and (ii) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.\",\"PeriodicalId\":365496,\"journal\":{\"name\":\"Proceedings of the 21st International Workshop on the Web and Databases\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st International Workshop on the Web and Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3201463.3201467\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Workshop on the Web and Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3201463.3201467","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

虚假和错误信息的扩散，特别是通过网络传播的，导致越来越多的精力投入到新闻事实核查中。事实核查记者通常会根据一些可信的数据来源来核查声明的准确性。统计数据库(如由国家机构编制的数据库)通常被用作可信的数据源，因为它们包含有价值的高质量信息。然而，当它们以HTML或电子表格等格式共享时，它们的可用性受到限制:这使得很难找到最相关的数据集来检查特定的索赔，或者从数据集中快速提取给定查询的最佳答案。我们提出了一种新的算法，通过(i)识别与给定事实检查查询最相关的统计数据集，以及(ii)从每个数据集中提取它可能包含的最佳特定(精确)查询答案，从而能够利用这些统计表。我们已经实施了我们的方法，并在从法国国家统计研究所INSEE获得的完整统计语料库上进行了实验。实验和比较表明了所提方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Searching for Truth in a Database of Statistics

The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by (i) identifying the statistic datasets most relevant for a given fact-checking query, and (ii) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st International Workshop on the Web and Databases

自引率

0.00%

发文量