{"title":"Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes","authors":"Chenxu Niu, Wei Zhang, S. Byna, Yong Chen","doi":"10.1109/HPEC55821.2022.9926389","DOIUrl":null,"url":null,"abstract":"Distributed representation methods for words have been developed for years, and numerous methods exist, such as word2vec, GloVe, and fastText. However, they are not designed for key-value pairs, which is an important data pattern and widely used in many scenarios. For example, metadata attributes of scientific files consist of a collection of key-value pairs. In this research, we propose kv2vec, a method that captures relationships between keys and values and represents key-value pairs in dense vectors. The fundamental idea of the kv2vec method is utilizing recurrent neural networks (RNNs) with long short-term memory (LSTM) hidden units to convert each key-value pair to a distributed vector representation. This new method overcomes the weaknesses of existing embedding models for representing key-value pairs as vectors. Moreover, it can be integrated into dataset search solutions through querying metadata attributes for self-describing file formats that are widely used in HPC systems. We evaluate the kv2vec method with multiple real-world datasets, and the results show that kv2vec outperforms existing models.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed representation methods for words have been developed for years, and numerous methods exist, such as word2vec, GloVe, and fastText. However, they are not designed for key-value pairs, which is an important data pattern and widely used in many scenarios. For example, metadata attributes of scientific files consist of a collection of key-value pairs. In this research, we propose kv2vec, a method that captures relationships between keys and values and represents key-value pairs in dense vectors. The fundamental idea of the kv2vec method is utilizing recurrent neural networks (RNNs) with long short-term memory (LSTM) hidden units to convert each key-value pair to a distributed vector representation. This new method overcomes the weaknesses of existing embedding models for representing key-value pairs as vectors. Moreover, it can be integrated into dataset search solutions through querying metadata attributes for self-describing file formats that are widely used in HPC systems. We evaluate the kv2vec method with multiple real-world datasets, and the results show that kv2vec outperforms existing models.