Evaluation of Machine Learning Methods for Relation Extraction Between Drug Adverse Effects and Medications in Russian Texts of Internet User Reviews

A. Sboev, A. Selivanov, R. Rybka, I. Moloshnikov, Gleb Rylkov
{"title":"Evaluation of Machine Learning Methods for Relation Extraction Between Drug Adverse Effects and Medications in Russian Texts of Internet User Reviews","authors":"A. Sboev, A. Selivanov, R. Rybka, I. Moloshnikov, Gleb Rylkov","doi":"10.22323/1.410.0006","DOIUrl":null,"url":null,"abstract":"The research considers an automatic extraction of relations between mentions of medications and adverse drug reactions in Russian-language drug reviews. This text analyzing method might be useful for pharmacovigilance and medicines reprofiling. Its application to Russian-language reviews hasn’t been studied yet due to the lack of corpora with relation annotation in Russian. The study is aimed at solving this problem. It is based on the original dataset gathered by our group. It consists of annotated relations between entities from the Russian Drug Review Corpus, that contains the Internet users’ reviews on medications in Russian language. Computational experiments were carried out on developed corpora using classical machine learning methods, as well as amore advanced neural networkmodel based on Transformer layers –XLM-RoBERTa-sag. The list of applied classical machine learning methods consists of support vector machine, logistic regression, Naive Bayes classifier and gradient boosting. The concatenation of TF-IDF entity vectors of character n-grams was used as a text representation. Based on a set of experiments, the following hyperparameters of these methods were selected: the size of n-grams and the limitation on the frequency of occurrence of n-grams (too rare or too frequent n-grams were excluded from the feature vector). For XLM-RoBERTa-sag, the input data is represented as usual for such type of models (languagemodels based on Transformer topology). The following input text representation types were considered during the experiments: a whole text, a text of target entity pairs; a text of target entity pairs with words between them; a text of target entity pairs and the whole input text, the latter input type is the one that maximizes accuracy. It is shown that XLM-RoBERTa-sag model achieves a result of 95%, according to the macro-averaged f1 metric, which is the stateof-the-art result of recognition of the relations between mentions of adverse drug reactions and medications in Russian-language online reviews. The Naive Bayes classifier with multivariate normal distribution achieves the best result among classical machine learning methods: 75%, which exceeds the result of random label generation by 21%.","PeriodicalId":217453,"journal":{"name":"Proceedings of The 5th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2021)","volume":"69 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The 5th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22323/1.410.0006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The research considers an automatic extraction of relations between mentions of medications and adverse drug reactions in Russian-language drug reviews. This text analyzing method might be useful for pharmacovigilance and medicines reprofiling. Its application to Russian-language reviews hasn’t been studied yet due to the lack of corpora with relation annotation in Russian. The study is aimed at solving this problem. It is based on the original dataset gathered by our group. It consists of annotated relations between entities from the Russian Drug Review Corpus, that contains the Internet users’ reviews on medications in Russian language. Computational experiments were carried out on developed corpora using classical machine learning methods, as well as amore advanced neural networkmodel based on Transformer layers –XLM-RoBERTa-sag. The list of applied classical machine learning methods consists of support vector machine, logistic regression, Naive Bayes classifier and gradient boosting. The concatenation of TF-IDF entity vectors of character n-grams was used as a text representation. Based on a set of experiments, the following hyperparameters of these methods were selected: the size of n-grams and the limitation on the frequency of occurrence of n-grams (too rare or too frequent n-grams were excluded from the feature vector). For XLM-RoBERTa-sag, the input data is represented as usual for such type of models (languagemodels based on Transformer topology). The following input text representation types were considered during the experiments: a whole text, a text of target entity pairs; a text of target entity pairs with words between them; a text of target entity pairs and the whole input text, the latter input type is the one that maximizes accuracy. It is shown that XLM-RoBERTa-sag model achieves a result of 95%, according to the macro-averaged f1 metric, which is the stateof-the-art result of recognition of the relations between mentions of adverse drug reactions and medications in Russian-language online reviews. The Naive Bayes classifier with multivariate normal distribution achieves the best result among classical machine learning methods: 75%, which exceeds the result of random label generation by 21%.
俄文互联网用户评论文本中药物不良反应与药物关系提取的机器学习方法评价
该研究考虑在俄语药物评论中自动提取药物提及和药物不良反应之间的关系。这种文本分析方法可用于药物警戒和药物重新分析。由于俄语中没有关系标注的语料库,因此尚未对关系标注在俄语评论中的应用进行研究。这项研究旨在解决这个问题。它是基于我们小组收集的原始数据集。它由俄罗斯药物评论语料库中实体之间的注释关系组成,该语料库包含互联网用户对俄语药物的评论。利用经典的机器学习方法和基于Transformer layers -XLM-RoBERTa-sag的更高级神经网络模型,在已开发的语料库上进行了计算实验。应用的经典机器学习方法包括支持向量机、逻辑回归、朴素贝叶斯分类器和梯度增强。字符n-图的TF-IDF实体向量的连接被用作文本表示。在一组实验的基础上,选择了这些方法的以下超参数:n-gram的大小和n-gram出现频率的限制(太少见或太频繁的n-gram被排除在特征向量之外)。对于XLM-RoBERTa-sag,输入数据按照这类模型(基于Transformer拓扑的语言模型)的通常方式表示。实验中考虑了以下输入文本表示类型:完整文本、目标实体对文本;目标实体对的文本,它们之间有单词;目标实体对的文本和整个输入文本,后一种输入类型具有最大的准确性。结果表明,根据宏观平均f1度量,XLM-RoBERTa-sag模型达到95%的结果,这是识别俄语在线评论中提及的药物不良反应与药物之间关系的最先进结果。多元正态分布的朴素贝叶斯分类器在经典机器学习方法中取得了最好的结果:75%,比随机标签生成的结果高出21%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信