基于数据挖掘技术的大型数据库模式匹配

2018 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2018-11-01 DOI:10.1109/ICDMW.2018.00083

Debora G. Reis, M. Ladeira, M. Holanda, M. Victorino

{"title":"基于数据挖掘技术的大型数据库模式匹配","authors":"Debora G. Reis, M. Ladeira, M. Holanda, M. Victorino","doi":"10.1109/ICDMW.2018.00083","DOIUrl":null,"url":null,"abstract":"With the expanding diversity of database technologies and database sizes, it is becoming increasingly hard to identify similar relational databases among many large databases stored in different Database Management Systems (DBMS). Therefore, we propose to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases. The amount of metadata is proportional to the size of the schema structure. The possibilities of combinations for comparison is quadratic in relation to the number of schemas analyzed. Looking for the most efficient technique, we propose to calculate the schema similarity evaluating a distance of all the schemas to just one schema, which is a start point. Obviously schemas with close distances are more similar than schemas with bigger distances. We compare this proposal against two other approaches. The first approach compares all schemas against all another schemas except for its inverse comparison. The second approach compares schemas in a group of schemas with similar sizes. To validate our proposal, an experiment is performed with 354 real schemas ranging in sizes from 2 to 20 thousand metadata, totaling together more than 26 thousand tables and 238 thousand columns. Those schemas came from 5 different DBMS. The metadata extracted is transformed and formatted for comparing pairs of a schema. The textual features are compared using Cosine Distance and numerical features are compared using Euclidean Distance. Then, the hierarchical cluster technique is used to facilitate the visualization of the schema that most closely resembled one another. Results showed that, our was the most efficient because it compared all schema and identified the most similar schema by its structure in less than 2 minutes. The extracted metadata was used to create the first version of the metadata repository and an initial version of a data catalog, which contributed to the knowledge of existing data. Using this procedure, duplicated schemas were discovered and then discontinued, resulting in a cost savings of 10% of cost savings, while freeing up infrastructure resources. This solution is flexible, it supports a variety of schema sizes and DBMS.","PeriodicalId":259600,"journal":{"name":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Large Database Schema Matching using Data Mining Techniques\",\"authors\":\"Debora G. Reis, M. Ladeira, M. Holanda, M. Victorino\",\"doi\":\"10.1109/ICDMW.2018.00083\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the expanding diversity of database technologies and database sizes, it is becoming increasingly hard to identify similar relational databases among many large databases stored in different Database Management Systems (DBMS). Therefore, we propose to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases. The amount of metadata is proportional to the size of the schema structure. The possibilities of combinations for comparison is quadratic in relation to the number of schemas analyzed. Looking for the most efficient technique, we propose to calculate the schema similarity evaluating a distance of all the schemas to just one schema, which is a start point. Obviously schemas with close distances are more similar than schemas with bigger distances. We compare this proposal against two other approaches. The first approach compares all schemas against all another schemas except for its inverse comparison. The second approach compares schemas in a group of schemas with similar sizes. To validate our proposal, an experiment is performed with 354 real schemas ranging in sizes from 2 to 20 thousand metadata, totaling together more than 26 thousand tables and 238 thousand columns. Those schemas came from 5 different DBMS. The metadata extracted is transformed and formatted for comparing pairs of a schema. The textual features are compared using Cosine Distance and numerical features are compared using Euclidean Distance. Then, the hierarchical cluster technique is used to facilitate the visualization of the schema that most closely resembled one another. Results showed that, our was the most efficient because it compared all schema and identified the most similar schema by its structure in less than 2 minutes. The extracted metadata was used to create the first version of the metadata repository and an initial version of a data catalog, which contributed to the knowledge of existing data. Using this procedure, duplicated schemas were discovered and then discontinued, resulting in a cost savings of 10% of cost savings, while freeing up infrastructure resources. This solution is flexible, it supports a variety of schema sizes and DBMS.\",\"PeriodicalId\":259600,\"journal\":{\"name\":\"2018 IEEE International Conference on Data Mining Workshops (ICDMW)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Data Mining Workshops (ICDMW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2018.00083\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2018.00083","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

随着数据库技术和数据库规模的日益多样化，在存储在不同数据库管理系统(DBMS)中的许多大型数据库中识别相似的关系数据库变得越来越困难。因此，我们建议使用数据挖掘技术，通过比较由数据库的物理细节组成的元数据来自动识别相似结构的关系数据库。元数据的数量与模式结构的大小成正比。用于比较的组合的可能性与所分析的模式的数量成二次关系。为了寻找最有效的技术，我们建议计算模式相似度，评估所有模式到一个模式的距离，这是一个起点。显然，距离较近的模式比距离较大的模式更相似。我们将这个建议与另外两种方法进行比较。第一种方法将所有模式与所有其他模式进行比较，除了反向比较。第二种方法是比较一组大小相似的模式中的模式。为了验证我们的建议，使用354个实际模式执行了一个实验，这些模式的大小从2到2万个元数据不等，总共超过2.6万个表和23.8万个列。这些模式来自5个不同的DBMS。对提取的元数据进行转换和格式化，以便比较模式对。用余弦距离比较文本特征，用欧几里得距离比较数值特征。然后，使用分层聚类技术来促进彼此最相似的模式的可视化。结果表明，我们在2分钟内比较了所有的图式，并根据图式的结构识别出最相似的图式，效率最高。提取的元数据用于创建元数据存储库的第一个版本和数据目录的初始版本，这有助于了解现有数据。使用此过程，可以发现重复的模式，然后停止，从而节省10%的成本，同时释放基础设施资源。这个解决方案是灵活的，它支持各种模式大小和DBMS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Database Schema Matching using Data Mining Techniques

With the expanding diversity of database technologies and database sizes, it is becoming increasingly hard to identify similar relational databases among many large databases stored in different Database Management Systems (DBMS). Therefore, we propose to use data mining techniques to automatically identify similar structures of relational databases by comparing their metadata, which is composed by physical details of the databases. The amount of metadata is proportional to the size of the schema structure. The possibilities of combinations for comparison is quadratic in relation to the number of schemas analyzed. Looking for the most efficient technique, we propose to calculate the schema similarity evaluating a distance of all the schemas to just one schema, which is a start point. Obviously schemas with close distances are more similar than schemas with bigger distances. We compare this proposal against two other approaches. The first approach compares all schemas against all another schemas except for its inverse comparison. The second approach compares schemas in a group of schemas with similar sizes. To validate our proposal, an experiment is performed with 354 real schemas ranging in sizes from 2 to 20 thousand metadata, totaling together more than 26 thousand tables and 238 thousand columns. Those schemas came from 5 different DBMS. The metadata extracted is transformed and formatted for comparing pairs of a schema. The textual features are compared using Cosine Distance and numerical features are compared using Euclidean Distance. Then, the hierarchical cluster technique is used to facilitate the visualization of the schema that most closely resembled one another. Results showed that, our was the most efficient because it compared all schema and identified the most similar schema by its structure in less than 2 minutes. The extracted metadata was used to create the first version of the metadata repository and an initial version of a data catalog, which contributed to the knowledge of existing data. Using this procedure, duplicated schemas were discovered and then discontinued, resulting in a cost savings of 10% of cost savings, while freeing up infrastructure resources. This solution is flexible, it supports a variety of schema sizes and DBMS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量