Improving The Performance of Semantic Text Similarity Tasks on Short Text Pairs

2022 20th International Conference on Language Engineering (ESOLEC) Pub Date : 2022-10-12 DOI:10.1109/ESOLEC54569.2022.10009072

Mohamed Taher Gamal, Passent El-Kafrawy

{"title":"Improving The Performance of Semantic Text Similarity Tasks on Short Text Pairs","authors":"Mohamed Taher Gamal, Passent El-Kafrawy","doi":"10.1109/ESOLEC54569.2022.10009072","DOIUrl":null,"url":null,"abstract":"Training semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.","PeriodicalId":179850,"journal":{"name":"2022 20th International Conference on Language Engineering (ESOLEC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 20th International Conference on Language Engineering (ESOLEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESOLEC54569.2022.10009072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Training semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.

查看原文本刊更多论文

提高短文本对语义文本相似度任务的性能

训练语义相似度模型来检测重复文本对是一项具有挑战性的任务，因为几乎所有的数据集都是不平衡的，从数据本质上来说，正样本少于负样本，这个问题很容易导致模型偏差。在不平衡数据上使用传统的两两二元交叉熵或对比损失等两两损失函数可能会导致模型偏差，而三重损失函数相比其他损失函数表现出更好的性能。在基于三联体损失的模型中，数据被输入到模型中:锚定句、肯定句和否定句。原始数据按照输入结构进行排列。训练样本数据的默认结构为363,861个训练样本(占数据的90%)，分布为134,336个正样本和229,524个负样本。三联体结构化数据有助于生成更大数量的平衡训练样本456,219。测试结果显示出较高的准确性和测试f1分。我们使用三重损失方法对RoBERTa预训练模型进行了微调，测试显示出较好的效果。最佳模型的F1得分为89.51分，准确率为91.45分，第二好的基于对比损失的BERT模型的F1得分为86.74分，准确率为87.45分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 20th International Conference on Language Engineering (ESOLEC)

自引率

0.00%

发文量