{"title":"中葡双向平行语料库管理设计","authors":"Lap-Man Hoi, Wei Ke, S. Im","doi":"10.1109/CCAI57533.2023.10201319","DOIUrl":null,"url":null,"abstract":"As deep learning techniques continue to mature, machine translation (MT) is gaining popularity among translators. However, the accuracy of machine translation depends not only on the size of the parallel corpus but also on the quality of the parallel corpus. The management of these massive parallel corpora is often unaware due to the lack of tools. As a result, many conflicting and confusing parallel corpora are trained together to influence the MT engines. Therefore, this study proposes a novel parallel corpus database design aimed at assisting data management efforts. After a series of experimental tests, our proposed database design can effectively generate domain-specific MT models with better BiLingual Evaluation Understudy (BLEU) values than other models. Furthermore, this database design helps to analyze, validate, and evaluate the quality of parallel corpora in database engines.","PeriodicalId":285760,"journal":{"name":"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Corpus Database Management Design for Chinese-Portuguese Bidirectional Parallel Corpora\",\"authors\":\"Lap-Man Hoi, Wei Ke, S. Im\",\"doi\":\"10.1109/CCAI57533.2023.10201319\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As deep learning techniques continue to mature, machine translation (MT) is gaining popularity among translators. However, the accuracy of machine translation depends not only on the size of the parallel corpus but also on the quality of the parallel corpus. The management of these massive parallel corpora is often unaware due to the lack of tools. As a result, many conflicting and confusing parallel corpora are trained together to influence the MT engines. Therefore, this study proposes a novel parallel corpus database design aimed at assisting data management efforts. After a series of experimental tests, our proposed database design can effectively generate domain-specific MT models with better BiLingual Evaluation Understudy (BLEU) values than other models. Furthermore, this database design helps to analyze, validate, and evaluate the quality of parallel corpora in database engines.\",\"PeriodicalId\":285760,\"journal\":{\"name\":\"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCAI57533.2023.10201319\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCAI57533.2023.10201319","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Corpus Database Management Design for Chinese-Portuguese Bidirectional Parallel Corpora
As deep learning techniques continue to mature, machine translation (MT) is gaining popularity among translators. However, the accuracy of machine translation depends not only on the size of the parallel corpus but also on the quality of the parallel corpus. The management of these massive parallel corpora is often unaware due to the lack of tools. As a result, many conflicting and confusing parallel corpora are trained together to influence the MT engines. Therefore, this study proposes a novel parallel corpus database design aimed at assisting data management efforts. After a series of experimental tests, our proposed database design can effectively generate domain-specific MT models with better BiLingual Evaluation Understudy (BLEU) values than other models. Furthermore, this database design helps to analyze, validate, and evaluate the quality of parallel corpora in database engines.