Debora G. Reis, Rommel N. Carvalho, Ricardo Silva Carvalho, M. Ladeira
{"title":"Two-phase Parallel Learning to Identify Similar Structures Among Relational Databases","authors":"Debora G. Reis, Rommel N. Carvalho, Ricardo Silva Carvalho, M. Ladeira","doi":"10.1109/ICMLA.2017.00-17","DOIUrl":null,"url":null,"abstract":"The need for efficient techniques for dealing with large databases increases as the number of large databases grows. We propose a new two-phase parallel learning approach to identify similar structures of relational databases fast. Each phase represents a level of relational metadata aggregation. To test the approach, we realized an experiment in with several large databases of Ministry of Social Development of Brazil to classify which relational database have a similar structure of tables and columns, based on its metadata. The measure of similarity considered Levenshtein and cosine. Generalized Linear Model, Random Forest, and Gradient Boost Machines (GBM) techniques are applied to develop the model. Each model was executed in sequential and parallel processing and had performance compared. As results, the parallel execution of GBM was at least ten times faster than the sequential processing. The results encourage further applications of the propositional parallel learning in relational databases.","PeriodicalId":6636,"journal":{"name":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"458 1","pages":"1020-1023"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2017.00-17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The need for efficient techniques for dealing with large databases increases as the number of large databases grows. We propose a new two-phase parallel learning approach to identify similar structures of relational databases fast. Each phase represents a level of relational metadata aggregation. To test the approach, we realized an experiment in with several large databases of Ministry of Social Development of Brazil to classify which relational database have a similar structure of tables and columns, based on its metadata. The measure of similarity considered Levenshtein and cosine. Generalized Linear Model, Random Forest, and Gradient Boost Machines (GBM) techniques are applied to develop the model. Each model was executed in sequential and parallel processing and had performance compared. As results, the parallel execution of GBM was at least ten times faster than the sequential processing. The results encourage further applications of the propositional parallel learning in relational databases.