Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper
{"title":"线条背后的故事折线图是发现数据集的入口","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":null,"url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\nessential insights from a dataset. However, access to the underlying dataset\nbehind a line chart is rarely readily available. In this paper, we explore a\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\nthe use of line charts as queries to discover datasets within a large data\nrepository that are capable of generating similar line charts. To solve this\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\nLearning Model (FCM), which aims to estimate the relevance between a line chart\nand a candidate dataset. To achieve this goal, FCM first employs a visual\nelement extractor to extract informative visual elements, i.e., lines and\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\nto learn representations for a line chart and a dataset, preserving\nfine-grained information, followed by a cross-modal matcher to match the\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\nsupport line chart queries generated based on data aggregation. Last, we\npropose a benchmark tailored for this problem since no such dataset exists.\nExtensive evaluation on the new benchmark verifies the effectiveness of our\nproposed method. Specifically, our proposed approach surpasses the best\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery\",\"authors\":\"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper\",\"doi\":\"arxiv-2408.09506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Line charts are a valuable tool for data analysis and exploration, distilling\\nessential insights from a dataset. However, access to the underlying dataset\\nbehind a line chart is rarely readily available. In this paper, we explore a\\nnovel dataset discovery problem, dataset discovery via line charts, focusing on\\nthe use of line charts as queries to discover datasets within a large data\\nrepository that are capable of generating similar line charts. To solve this\\nproblem, we propose a novel approach called Fine-grained Cross-modal Relevance\\nLearning Model (FCM), which aims to estimate the relevance between a line chart\\nand a candidate dataset. To achieve this goal, FCM first employs a visual\\nelement extractor to extract informative visual elements, i.e., lines and\\ny-ticks, from a line chart. Then, two novel segment-level encoders are adopted\\nto learn representations for a line chart and a dataset, preserving\\nfine-grained information, followed by a cross-modal matcher to match the\\nlearned representations in a fine-grained way. Furthermore, we extend FCM to\\nsupport line chart queries generated based on data aggregation. Last, we\\npropose a benchmark tailored for this problem since no such dataset exists.\\nExtensive evaluation on the new benchmark verifies the effectiveness of our\\nproposed method. Specifically, our proposed approach surpasses the best\\nbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery
Line charts are a valuable tool for data analysis and exploration, distilling
essential insights from a dataset. However, access to the underlying dataset
behind a line chart is rarely readily available. In this paper, we explore a
novel dataset discovery problem, dataset discovery via line charts, focusing on
the use of line charts as queries to discover datasets within a large data
repository that are capable of generating similar line charts. To solve this
problem, we propose a novel approach called Fine-grained Cross-modal Relevance
Learning Model (FCM), which aims to estimate the relevance between a line chart
and a candidate dataset. To achieve this goal, FCM first employs a visual
element extractor to extract informative visual elements, i.e., lines and
y-ticks, from a line chart. Then, two novel segment-level encoders are adopted
to learn representations for a line chart and a dataset, preserving
fine-grained information, followed by a cross-modal matcher to match the
learned representations in a fine-grained way. Furthermore, we extend FCM to
support line chart queries generated based on data aggregation. Last, we
propose a benchmark tailored for this problem since no such dataset exists.
Extensive evaluation on the new benchmark verifies the effectiveness of our
proposed method. Specifically, our proposed approach surpasses the best
baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.