Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader
{"title":"TabReformer:用于错误数据检测的无监督表示学习","authors":"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader","doi":"10.1145/3447541","DOIUrl":null,"url":null,"abstract":"Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.","PeriodicalId":93404,"journal":{"name":"ACM/IMS transactions on data science","volume":"2 1","pages":"1 - 29"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3447541","citationCount":"1","resultStr":"{\"title\":\"TabReformer: Unsupervised Representation Learning for Erroneous Data Detection\",\"authors\":\"Mona Nashaat, Aindrila Ghosh, James Miller, Shaikh Quader\",\"doi\":\"10.1145/3447541\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.\",\"PeriodicalId\":93404,\"journal\":{\"name\":\"ACM/IMS transactions on data science\",\"volume\":\"2 1\",\"pages\":\"1 - 29\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/3447541\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM/IMS transactions on data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447541\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IMS transactions on data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447541","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TabReformer: Unsupervised Representation Learning for Erroneous Data Detection
Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.