Semantic classification of Indonesian consumer health questions.

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Biomedical Semantics Pub Date : 2025-07-28 DOI:10.1186/s13326-025-00334-5

Raniah Nur Hanami, Rahmad Mahendra, Alfan Farizki Wicaksono

{"title":"Semantic classification of Indonesian consumer health questions.","authors":"Raniah Nur Hanami, Rahmad Mahendra, Alfan Farizki Wicaksono","doi":"10.1186/s13326-025-00334-5","DOIUrl":null,"url":null,"abstract":"Purpose: Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.Methods: This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model's predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model's decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of \"semantic bias\", where words with no inherent association with a specific semantic type disproportionately influence the model's predictions.Results: The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words \"kanker\" (cancer) and \"depresi\" (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.Conclusion: We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"13"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12302743/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-025-00334-5","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient's intent and route them towards the relevant information.

Methods: This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model's predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model's decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of "semantic bias", where words with no inherent association with a specific semantic type disproportionately influence the model's predictions.

Results: The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words "kanker" (cancer) and "depresi" (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets.

Conclusion: We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.

Abstract Image

查看原文本刊更多论文

印度尼西亚消费者健康问题的语义分类。

目的：在线消费者健康论坛是公众与医疗专业人员联系的一种方式。虽然这些医疗论坛提供了有价值的服务，但由于可用的医疗保健专业人员数量有限，在线问答（QA）论坛可能难以及时提供答案。解决这个问题的一种方法是开发一种自动QA系统，可以为患者提供更快的答案。这种系统的一个关键组成部分可能是一个用于对问题的语义类型进行分类的模块。这将使系统了解患者的意图，并将他们导向相关信息。方法：本文提出了一种新的两步方法来解决印尼消费者健康问题中语义类型分类的挑战。我们承认印度尼西亚卫生领域数据的缺乏，这是机器学习模型的一个障碍。为了解决这一差距，我们首先介绍了一个新的注释印尼消费者健康问题的语料库。其次，我们利用这个新创建的语料库来构建和评估一个数据驱动的预测模型，用于对问题语义类型进行分类。为了提高模型预测的可信度和可解释性，我们采用了一个可解释的模型框架LIME。这个框架有助于更深入地理解基于单词的特征在模型决策过程中所起的作用。此外，它使我们能够进行全面的偏差分析，允许检测“语义偏差”，其中与特定语义类型没有固有关联的单词不成比例地影响模型的预测。结果：标注过程显示专家标注者之间的一致性中等。此外，并非所有具有高LIME概率的单词都可以被认为是问题类型的真实特征。这表明所使用的数据和机器学习模型本身存在潜在的偏差。值得注意的是，XGBoost、Naïve贝叶斯和MLP模型显示出一种趋势，即预测包含“kanker”（癌症）和“depression”（抑郁症）的问题属于诊断类别。在预测性能方面，Perceptron和XGBoost是表现最好的模型，在所有输入场景和加权因素中获得了最高的加权平均F1分数。Naïve贝叶斯在使用Borderline SMOTE平衡数据后表现最好，这表明它有望处理不平衡数据集。结论：构建了印度尼西亚消费者健康领域的查询语料库，包含964个问题，并标注了相应的语义类型。该语料库是构建预测模型的基础。我们进一步研究了疾病偏倚词对模型性能的影响。这些词表现出较高的LIME得分，但缺乏与特定语义类型的关联。我们使用有或没有这些偏差词的数据集训练模型，发现两种情况下模型性能没有显著差异，这表明模型可能在学习过程中具有减轻此类偏差影响的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.