{"title":"Record2Vec: Unsupervised Representation Learning for Structured Records","authors":"Adelene Y. L. Sim, Andrew Borthwick","doi":"10.1109/ICDM.2018.00165","DOIUrl":null,"url":null,"abstract":"Structured records - data with a fixed number of descriptive fields (or attributes) - are often represented by one-hot encoded or term frequency-inverse document frequency (TF-IDF) weighted vectors. These vectors are typically sparse and long, and are inefficient in representing structured records. Here, we introduce Record2Vec, a framework for generating dense embeddings of structured records by training associations between attributes within record instances. We build our embedding from a simple premise that structured records have attributes that are associated, and therefore we can train the embedding of an attribute based on other attributes (or context), much like how we train embeddings for words based on their surrounding context. Because this embedding technique is general and does not assume the availability of any labeled data, it is extendable across different domains and fields. We demonstrate its utility in the context of clustering, record matching, movie rating and movie genre prediction.","PeriodicalId":286444,"journal":{"name":"2018 IEEE International Conference on Data Mining (ICDM)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2018.00165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Structured records - data with a fixed number of descriptive fields (or attributes) - are often represented by one-hot encoded or term frequency-inverse document frequency (TF-IDF) weighted vectors. These vectors are typically sparse and long, and are inefficient in representing structured records. Here, we introduce Record2Vec, a framework for generating dense embeddings of structured records by training associations between attributes within record instances. We build our embedding from a simple premise that structured records have attributes that are associated, and therefore we can train the embedding of an attribute based on other attributes (or context), much like how we train embeddings for words based on their surrounding context. Because this embedding technique is general and does not assume the availability of any labeled data, it is extendable across different domains and fields. We demonstrate its utility in the context of clustering, record matching, movie rating and movie genre prediction.