L. Nguyen, Ban Phuoc Dao, Duc-Vu Nguyen, N. Nguyen
{"title":"越南文上下文敏感的恶意拼写错误纠正","authors":"L. Nguyen, Ban Phuoc Dao, Duc-Vu Nguyen, N. Nguyen","doi":"10.1109/NICS51282.2020.9335909","DOIUrl":null,"url":null,"abstract":"Spelling errors targeting specific keywords that users intentionally generate has seriously degraded the performance of social media control systems. In this paper, we show the severe effect of those misspellings and propose using a spelling correction approach for those targeted words based on context called word embedding. The data that we use within the limits of our work are Vietnamese spam email and hate speech. Also, we introduce a new and effective way to extract real misspellings to create reasonably synthetic data provided for our experiments. Our correction system results in a favorable performance on both synthetic and real data compared to Google.","PeriodicalId":308944,"journal":{"name":"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Vietnamese Context-Sensitive Malicious Spelling Error Correction\",\"authors\":\"L. Nguyen, Ban Phuoc Dao, Duc-Vu Nguyen, N. Nguyen\",\"doi\":\"10.1109/NICS51282.2020.9335909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spelling errors targeting specific keywords that users intentionally generate has seriously degraded the performance of social media control systems. In this paper, we show the severe effect of those misspellings and propose using a spelling correction approach for those targeted words based on context called word embedding. The data that we use within the limits of our work are Vietnamese spam email and hate speech. Also, we introduce a new and effective way to extract real misspellings to create reasonably synthetic data provided for our experiments. Our correction system results in a favorable performance on both synthetic and real data compared to Google.\",\"PeriodicalId\":308944,\"journal\":{\"name\":\"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NICS51282.2020.9335909\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 7th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS51282.2020.9335909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Spelling errors targeting specific keywords that users intentionally generate has seriously degraded the performance of social media control systems. In this paper, we show the severe effect of those misspellings and propose using a spelling correction approach for those targeted words based on context called word embedding. The data that we use within the limits of our work are Vietnamese spam email and hate speech. Also, we introduce a new and effective way to extract real misspellings to create reasonably synthetic data provided for our experiments. Our correction system results in a favorable performance on both synthetic and real data compared to Google.