A Conceptual Framework For Malay-English Mixed-language Question Answering System

H. T. Lim, S. Huspi, R. Ibrahim
{"title":"A Conceptual Framework For Malay-English Mixed-language Question Answering System","authors":"H. T. Lim, S. Huspi, R. Ibrahim","doi":"10.1109/ICOTEN52080.2021.9493503","DOIUrl":null,"url":null,"abstract":"Mixed language has turned into a current trend of language which refers to combining two or more languages either in spoken or written form. It has been widely used in social media forums to improve communication and for a greater range of expression. The current question answering (QA) system only supports monolingual queries, which restricts the capability of multilingual users to have a natural interaction with the system. In recent years, there has been a rise of interest in multilingual QA systems where translation models merged with machine learning algorithms in question classification are the commonly used solution. However, using words from other languages in a single sentence has led to the problem of the inability to identify code-switch from the monolingual sentence; this has also caused the problem of limited captured language context from machine translation processed mistranslated questions. The informal mixed-language representation that disobeys the natural linguistic rule in particular languages provides a challenge for automated QA systems, as the systems would need to translate and extract answers for the given questions. Additionally, lack of public resources such as Chunker, POS Tagger, and WordNet for mixed-language, especially for Malay-English, leads to low performance of the translation and classification model. Furthermore, the use of machine learning algorithms in question classification requires a large number of structured training data to ensure performance. This is impracticable in the Malay-English mixed-language domain since the availability of the mixed-language dataset is still an issue. To solve these problems, we aim to propose a framework consisting of the combination of enhanced translation models with deep learning; by using Convolutional Neural Networks (CNN) to address the Malay-English mixed-language question classification to generate the best answer. The first part will study the machine translation model, where word-level language identification and text normalization towards Malay-English mixed-language questions will be developed. The second part will focus on the deep learning algorithm, where we will explore CNN as the classification model to assist in the translated questions to provide the best answer. Thus, in this paper, a framework consisting of an enhanced translation model for Malay-English, and also an end-to-end mixed-language question answering system for the Malay-English Q&A system, is presented. This research will provide a significant contribution to a multilingual forum platform and also to intelligent Q&A systems (chatbots).","PeriodicalId":308802,"journal":{"name":"2021 International Congress of Advanced Technology and Engineering (ICOTEN)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Congress of Advanced Technology and Engineering (ICOTEN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOTEN52080.2021.9493503","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Mixed language has turned into a current trend of language which refers to combining two or more languages either in spoken or written form. It has been widely used in social media forums to improve communication and for a greater range of expression. The current question answering (QA) system only supports monolingual queries, which restricts the capability of multilingual users to have a natural interaction with the system. In recent years, there has been a rise of interest in multilingual QA systems where translation models merged with machine learning algorithms in question classification are the commonly used solution. However, using words from other languages in a single sentence has led to the problem of the inability to identify code-switch from the monolingual sentence; this has also caused the problem of limited captured language context from machine translation processed mistranslated questions. The informal mixed-language representation that disobeys the natural linguistic rule in particular languages provides a challenge for automated QA systems, as the systems would need to translate and extract answers for the given questions. Additionally, lack of public resources such as Chunker, POS Tagger, and WordNet for mixed-language, especially for Malay-English, leads to low performance of the translation and classification model. Furthermore, the use of machine learning algorithms in question classification requires a large number of structured training data to ensure performance. This is impracticable in the Malay-English mixed-language domain since the availability of the mixed-language dataset is still an issue. To solve these problems, we aim to propose a framework consisting of the combination of enhanced translation models with deep learning; by using Convolutional Neural Networks (CNN) to address the Malay-English mixed-language question classification to generate the best answer. The first part will study the machine translation model, where word-level language identification and text normalization towards Malay-English mixed-language questions will be developed. The second part will focus on the deep learning algorithm, where we will explore CNN as the classification model to assist in the translated questions to provide the best answer. Thus, in this paper, a framework consisting of an enhanced translation model for Malay-English, and also an end-to-end mixed-language question answering system for the Malay-English Q&A system, is presented. This research will provide a significant contribution to a multilingual forum platform and also to intelligent Q&A systems (chatbots).
马来-英语混合语言问答系统的概念框架
混合语言是指将两种或两种以上的语言以口头或书面形式结合在一起的一种语言发展趋势。它被广泛用于社交媒体论坛,以改善沟通和扩大表达范围。目前的问答系统只支持单语言查询,这限制了多语言用户与系统进行自然交互的能力。近年来,人们对多语言QA系统的兴趣日益浓厚,其中翻译模型与问题分类中的机器学习算法相结合是常用的解决方案。然而,在单句中使用其他语言的单词会导致无法识别单语句子的代码转换问题;这也导致了机器翻译处理误译问题时捕获的语言上下文有限的问题。非正式的混合语言表示不遵守特定语言的自然语言规则,这给自动化QA系统带来了挑战,因为系统需要翻译和提取给定问题的答案。此外,缺乏针对混合语言,特别是马来语-英语的Chunker、POS Tagger和WordNet等公共资源,导致翻译和分类模型的性能较低。此外,在问题分类中使用机器学习算法需要大量的结构化训练数据来保证性能。这在马来语-英语混合语言领域是不切实际的,因为混合语言数据集的可用性仍然是一个问题。为了解决这些问题,我们的目标是提出一个由增强翻译模型与深度学习相结合的框架;利用卷积神经网络(CNN)对马来语-英语混合语言问题进行分类,生成最佳答案。第一部分将研究机器翻译模型,其中将开发针对马来-英语混合语言问题的词级语言识别和文本规范化。第二部分将重点关注深度学习算法,其中我们将探索CNN作为分类模型来协助翻译问题提供最佳答案。因此,本文提出了一个由马来语-英语的增强型翻译模型和马来语-英语问答系统的端到端混合语言问答系统组成的框架。这项研究将为多语言论坛平台和智能问答系统(聊天机器人)提供重大贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信