L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham
{"title":"基于高质量集群保证的地理空间模式匹配与社交网络位置挖掘","authors":"L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham","doi":"10.1109/ICDMW.2010.204","DOIUrl":null,"url":null,"abstract":"In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Geospatial Schema Matching with High-Quality Cluster Assurance and Location Mining from Social Network\",\"authors\":\"L. Khan, J. Partyka, Satyen Abrol, B. Thuraisingham\",\"doi\":\"10.1109/ICDMW.2010.204\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.\",\"PeriodicalId\":170201,\"journal\":{\"name\":\"2010 IEEE International Conference on Data Mining Workshops\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Data Mining Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW.2010.204\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Data Mining Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2010.204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Geospatial Schema Matching with High-Quality Cluster Assurance and Location Mining from Social Network
In this talk, we will present how semantics can improve the quality of the data mining process. In particular, first, we will focus on geospatial schema matching with high quality cluster assurance. Next, we will focus on location mining from social network. With regard to the first problem, resolving semantic heterogeneity across distinct data sources remains a highly relevant problem in the GIS domain requiring innovative solutions. Our approach, called GSim, semantically aligns tables from respective GIS databases by first choosing attributes for comparison. We then examine their instances and calculate a similarity value between them called Entropy-Based Distribution (EBD) by combining two separate methods. Our primary method discerns the geographic types from instances of compared attributes. If geographic type matching is not possible, we then apply a generic schema matching method which employs normalized Google distance with the usage of clustering process. GSim proceeds by deriving clusters from attribute instances based on content and their geographic types (if possible), gleaned from a gazetteer. However, clustering algorithms may produce inconsistent results based on variable cluster quality. We apply novel metrics measuring cluster distance and purity to guarantee high-quality homogeneous clusters. The end result is a wholly geospatial similarity value, expressed as EBD. We show the effectiveness of our approach over the traditional N-gram approach across multi-jurisdictional datasets by generating impressive results. With regard to the second problem, we will predict the location of the user on the basis of his social network (e.g., Twitter) using the strong theoretical framework of semi-supervised learning, in particular, we employ label propagation algorithm. For privacy and security reasons, most of the people on social networking sites like Twitter are unwilling to specify their locations explicitly. On the city locations returned by the algorithm, the system performs agglomerative clustering based on geospatial proximity and their individual scores to return cluster of locations with higher confidence. We perform extensive experiments to show the validity of our system in terms of both accuracy and running time. Experimental results show that our approach outperforms the content based geo-tagging approach in both accuracy and running time.