多假设CSV解析

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI:10.1145/3085504.3085520

Till Döhmen, H. Mühleisen, P. Boncz

{"title":"多假设CSV解析","authors":"Till Döhmen, H. Mühleisen, P. Boncz","doi":"10.1145/3085504.3085520","DOIUrl":null,"url":null,"abstract":"Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Multi-Hypothesis CSV Parsing\",\"authors\":\"Till Döhmen, H. Mühleisen, P. Boncz\",\"doi\":\"10.1145/3085504.3085520\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.\",\"PeriodicalId\":431308,\"journal\":{\"name\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3085504.3085520\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3085520","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

逗号分隔值(Comma Separated Value, CSV)文件通常用于表示数据。CSV是一种非常简单的格式，但是我们展示了它在解析和解释中会产生大量的歧义。我们总结了CSV解析器中的最新技术，它通常会做出一系列线性的解析和解释决策，因此在早期阶段的任何错误决策都会对所有下游决策产生负面影响。由于计算时间比人类时间少得多，我们建议将CSV解析转化为排序问题。我们的面向质量的多假设CSV解析方法生成了关于方言、表结构等多个并发假设，并根据生成表的质量特征对这些假设进行排序。这种方法使创建高级CSV解析器成为可能，该解析器可以做出许多不同的决策，同时使整个解析器代码保持简单的插件基础结构。这些决策之间的复杂交互可以通过搜索假设空间来处理，而不必在代码中编写这些交互。我们表明，我们的方法比现有的方法产生了更好的解析结果，并促进了对异构CSV文件的大型语料库的解析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-Hypothesis CSV Parsing

Comma Separated Value (CSV) files are commonly used to represent data. CSV is a very simple format, yet we show that it gives rise to a surprisingly large amount of ambiguities in its parsing and interpretation. We summarize the state-of-the-art in CSV parsers, which typically make a linear series of parsing and interpretation decisions, such that any wrong decision at an earlier stage can negatively affect all downstream decisions. Since computation time is much less scarce than human time, we propose to turn CSV parsing into a ranking problem. Our quality-oriented multi-hypothesis CSV parsing approach generates several concurrent hypotheses about dialect, table structure, etc. and ranks these hypotheses based on quality features of the resulting table. This approach makes it possible to create an advanced CSV parser that makes many different decisions, yet keeps the overall parser code a simple plug-in infrastructure. The complex interactions between these decisions are taken care of by searching the hypothesis space rather than by having to program these many interactions in code. We show that our approach leads to better parsing results than the state of the art and facilitates the parsing of large corpora of heterogeneous CSV files.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量