{"title":"相似数据集的各个方面如何影响分布模型","authors":"Isabella Maria Alonso Gomes, N. T. Roman","doi":"10.5753/eniac.2022.227085","DOIUrl":null,"url":null,"abstract":"Distributional models have become popular due to the abstractions that allowed their immediate use, with good results and little implementation effort when compared to precursor models. Given their presumed high level of generalization it would be expected that good and similar results would be found in data sets sharing the same nature and purpose. However, this is not always the case. In this work, we present the results of the application of BERTimbau in two related data sets, built for the task of Semantic Similarity identification, with the goal of detecting redundancy in text. Results showed that there are considerable differences in accuracy between the data sets. We explore aspects of the data sets that could explain why accuracy results are different across them.","PeriodicalId":165095,"journal":{"name":"Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2022)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How aspects of similar datasets can impact distributional models\",\"authors\":\"Isabella Maria Alonso Gomes, N. T. Roman\",\"doi\":\"10.5753/eniac.2022.227085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributional models have become popular due to the abstractions that allowed their immediate use, with good results and little implementation effort when compared to precursor models. Given their presumed high level of generalization it would be expected that good and similar results would be found in data sets sharing the same nature and purpose. However, this is not always the case. In this work, we present the results of the application of BERTimbau in two related data sets, built for the task of Semantic Similarity identification, with the goal of detecting redundancy in text. Results showed that there are considerable differences in accuracy between the data sets. We explore aspects of the data sets that could explain why accuracy results are different across them.\",\"PeriodicalId\":165095,\"journal\":{\"name\":\"Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2022)\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2022)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/eniac.2022.227085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/eniac.2022.227085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
How aspects of similar datasets can impact distributional models
Distributional models have become popular due to the abstractions that allowed their immediate use, with good results and little implementation effort when compared to precursor models. Given their presumed high level of generalization it would be expected that good and similar results would be found in data sets sharing the same nature and purpose. However, this is not always the case. In this work, we present the results of the application of BERTimbau in two related data sets, built for the task of Semantic Similarity identification, with the goal of detecting redundancy in text. Results showed that there are considerable differences in accuracy between the data sets. We explore aspects of the data sets that could explain why accuracy results are different across them.