A. Selivanov, A. Gryaznov, R. Rybka, A. Sboev, S. Sboeva, Yuliya Klueva
{"title":"基于多语言模型的药理学重要信息文本关系提取","authors":"A. Selivanov, A. Gryaznov, R. Rybka, A. Sboev, S. Sboeva, Yuliya Klueva","doi":"10.22323/1.429.0014","DOIUrl":null,"url":null,"abstract":"In this paper we estimate the accuracy of the relation extraction from texts containing pharmacologically significant information on base of the expanded version of RDRS corpus, which contains texts of internet reviews on medications in Russian. The accuracy of relation extraction is estimated and compared for two multilingual language models: XLM-RoBERTa-large and XLM-RoBERTa-large-sag. Earlier research proved XLM-RoBERTa-large-sag to be the most efficient language model for the previous version of the RDRS dataset for relation extraction using a ground-truth named entities annotation. In the current work we use two-step relation extraction approach: automated named entity recognition and extraction of relations between predicted entities. The implemented approach has given an opportunity to estimate the accuracy of the proposed solution to the relation extraction problem, as well as to estimate the accuracy at each step of the analysis. As a result, it is shown, that multilingual XLM-RoBERTa-large-sag model achieves relation extraction macro-averaged f1-score equals to 86.4% on the ground-truth named entities, 60.1% on the predicted named entities on the new version of the RDRS corpus contained more than 3800 annotated texts. Consequently, implemented approach based on the","PeriodicalId":262901,"journal":{"name":"Proceedings of The 6th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2022)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Relation Extraction from Texts Containing Pharmacologically Significant Information on base of Multilingual Language Models\",\"authors\":\"A. Selivanov, A. Gryaznov, R. Rybka, A. Sboev, S. Sboeva, Yuliya Klueva\",\"doi\":\"10.22323/1.429.0014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we estimate the accuracy of the relation extraction from texts containing pharmacologically significant information on base of the expanded version of RDRS corpus, which contains texts of internet reviews on medications in Russian. The accuracy of relation extraction is estimated and compared for two multilingual language models: XLM-RoBERTa-large and XLM-RoBERTa-large-sag. Earlier research proved XLM-RoBERTa-large-sag to be the most efficient language model for the previous version of the RDRS dataset for relation extraction using a ground-truth named entities annotation. In the current work we use two-step relation extraction approach: automated named entity recognition and extraction of relations between predicted entities. The implemented approach has given an opportunity to estimate the accuracy of the proposed solution to the relation extraction problem, as well as to estimate the accuracy at each step of the analysis. As a result, it is shown, that multilingual XLM-RoBERTa-large-sag model achieves relation extraction macro-averaged f1-score equals to 86.4% on the ground-truth named entities, 60.1% on the predicted named entities on the new version of the RDRS corpus contained more than 3800 annotated texts. Consequently, implemented approach based on the\",\"PeriodicalId\":262901,\"journal\":{\"name\":\"Proceedings of The 6th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2022)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of The 6th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2022)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22323/1.429.0014\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The 6th International Workshop on Deep Learning in Computational Physics — PoS(DLCP2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22323/1.429.0014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Relation Extraction from Texts Containing Pharmacologically Significant Information on base of Multilingual Language Models
In this paper we estimate the accuracy of the relation extraction from texts containing pharmacologically significant information on base of the expanded version of RDRS corpus, which contains texts of internet reviews on medications in Russian. The accuracy of relation extraction is estimated and compared for two multilingual language models: XLM-RoBERTa-large and XLM-RoBERTa-large-sag. Earlier research proved XLM-RoBERTa-large-sag to be the most efficient language model for the previous version of the RDRS dataset for relation extraction using a ground-truth named entities annotation. In the current work we use two-step relation extraction approach: automated named entity recognition and extraction of relations between predicted entities. The implemented approach has given an opportunity to estimate the accuracy of the proposed solution to the relation extraction problem, as well as to estimate the accuracy at each step of the analysis. As a result, it is shown, that multilingual XLM-RoBERTa-large-sag model achieves relation extraction macro-averaged f1-score equals to 86.4% on the ground-truth named entities, 60.1% on the predicted named entities on the new version of the RDRS corpus contained more than 3800 annotated texts. Consequently, implemented approach based on the