{"title":"Self-Enhancing Video Data Management System for Compositional Events with Large Language Models [Technical Report]","authors":"Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska","doi":"arxiv-2408.02243","DOIUrl":null,"url":null,"abstract":"Complex video queries can be answered by decomposing them into modular\nsubtasks. However, existing video data management systems assume the existence\nof predefined modules for each subtask. We introduce VOCAL-UDF, a novel\nself-enhancing system that supports compositional queries over videos without\nthe need for predefined modules. VOCAL-UDF automatically identifies and\nconstructs missing modules and encapsulates them as user-defined functions\n(UDFs), thus expanding its querying capabilities. To achieve this, we formulate\na unified UDF model that leverages large language models (LLMs) to aid in new\nUDF generation. VOCAL-UDF handles a wide range of concepts by supporting both\nprogram-based UDFs (i.e., Python functions generated by LLMs) and\ndistilled-model UDFs (lightweight vision models distilled from strong\npretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF\ngenerates multiple candidate UDFs and uses active learning to efficiently\nselect the best one. With the self-enhancing capability, VOCAL-UDF\nsignificantly improves query performance across three video datasets.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Complex video queries can be answered by decomposing them into modular
subtasks. However, existing video data management systems assume the existence
of predefined modules for each subtask. We introduce VOCAL-UDF, a novel
self-enhancing system that supports compositional queries over videos without
the need for predefined modules. VOCAL-UDF automatically identifies and
constructs missing modules and encapsulates them as user-defined functions
(UDFs), thus expanding its querying capabilities. To achieve this, we formulate
a unified UDF model that leverages large language models (LLMs) to aid in new
UDF generation. VOCAL-UDF handles a wide range of concepts by supporting both
program-based UDFs (i.e., Python functions generated by LLMs) and
distilled-model UDFs (lightweight vision models distilled from strong
pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF
generates multiple candidate UDFs and uses active learning to efficiently
select the best one. With the self-enhancing capability, VOCAL-UDF
significantly improves query performance across three video datasets.