The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery

arXiv - CS - Databases Pub Date : 2024-08-18 DOI:arxiv-2408.09506

Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper

{"title":"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":null,"url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\nessential insights from a dataset. However, access to the underlying dataset\nbehind a line chart is rarely readily available. In this paper, we explore a\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\nthe use of line charts as queries to discover datasets within a large data\nrepository that are capable of generating similar line charts. To solve this\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\nLearning Model (FCM), which aims to estimate the relevance between a line chart\nand a candidate dataset. To achieve this goal, FCM first employs a visual\nelement extractor to extract informative visual elements, i.e., lines and\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\nto learn representations for a line chart and a dataset, preserving\nfine-grained information, followed by a cross-modal matcher to match the\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\nsupport line chart queries generated based on data aggregation. Last, we\npropose a benchmark tailored for this problem since no such dataset exists.\nExtensive evaluation on the new benchmark verifies the effectiveness of our\nproposed method. Specifically, our proposed approach surpasses the best\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying dataset behind a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and a candidate dataset. To achieve this goal, FCM first employs a visual element extractor to extract informative visual elements, i.e., lines and y-ticks, from a line chart. Then, two novel segment-level encoders are adopted to learn representations for a line chart and a dataset, preserving fine-grained information, followed by a cross-modal matcher to match the learned representations in a fine-grained way. Furthermore, we extend FCM to support line chart queries generated based on data aggregation. Last, we propose a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.

查看原文本刊更多论文

线条背后的故事折线图是发现数据集的入口

折线图是数据分析和探索的重要工具，能从数据集中提炼出重要的见解。然而，人们很少能随时访问折线图背后的底层数据集。在本文中，我们探讨了一个新的数据集发现问题--通过折线图发现数据集，重点是使用折线图作为查询来发现大型数据存储库中能够生成类似折线图的数据集。为了解决这个问题，我们提出了一种名为细粒度跨模态相关性学习模型（FCM）的新方法，旨在估计折线图与候选数据集之间的相关性。为实现这一目标，FCM 首先使用视觉元素提取器从折线图中提取信息丰富的视觉元素，即线条和y-ticks。然后，采用两个新颖的分段级编码器来学习线形图和数据集的表征，保留细粒度信息，接着采用跨模态匹配器以细粒度方式匹配学习到的表征。此外，我们还将 FCM 扩展到支持基于数据聚合生成的折线图查询。最后，我们提出了一个专门针对这一问题的基准，因为目前还不存在这样的数据集。具体来说，我们提出的方法在prec@50和ndcg@50方面分别比最佳基准高出30.1%和41.0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量