{"title":"使用Bi-LSTM变体的生物医学专利文本命名实体识别","authors":"Farag Saad","doi":"10.1145/3366030.3366104","DOIUrl":null,"url":null,"abstract":"Recent years have shown a substantial increase in biomedical publications (patents or scientific articles) that are multiplying at a daily pace. This has led to an increased interest in the extraction of meaningful information (e.g., named entities) from these publications. Traditional NER approaches demand a considerable level of engineering skills and domain expertise in designing rules and features for better algorithm accuracy. In addition, due to the structure and linguistic complexity of the patent text, constructing such rules and features is often a challenging task. In this paper, we investigate various variants of the Bi-LSTM model performance for NER task based on features generated automatically from an unlabelled genes and proteins patent corpora. The proposed model is able to capture the context representation of an input sequence and globally assign the related labels for each token. The CHARS-Bi-LSTM-EMA variant yielded the best performance and significantly outperformed the state-of-the art approach.","PeriodicalId":446280,"journal":{"name":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Named Entity Recognition for Biomedical Patent Text using Bi-LSTM Variants\",\"authors\":\"Farag Saad\",\"doi\":\"10.1145/3366030.3366104\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent years have shown a substantial increase in biomedical publications (patents or scientific articles) that are multiplying at a daily pace. This has led to an increased interest in the extraction of meaningful information (e.g., named entities) from these publications. Traditional NER approaches demand a considerable level of engineering skills and domain expertise in designing rules and features for better algorithm accuracy. In addition, due to the structure and linguistic complexity of the patent text, constructing such rules and features is often a challenging task. In this paper, we investigate various variants of the Bi-LSTM model performance for NER task based on features generated automatically from an unlabelled genes and proteins patent corpora. The proposed model is able to capture the context representation of an input sequence and globally assign the related labels for each token. The CHARS-Bi-LSTM-EMA variant yielded the best performance and significantly outperformed the state-of-the art approach.\",\"PeriodicalId\":446280,\"journal\":{\"name\":\"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services\",\"volume\":\"144 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3366030.3366104\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366030.3366104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Named Entity Recognition for Biomedical Patent Text using Bi-LSTM Variants
Recent years have shown a substantial increase in biomedical publications (patents or scientific articles) that are multiplying at a daily pace. This has led to an increased interest in the extraction of meaningful information (e.g., named entities) from these publications. Traditional NER approaches demand a considerable level of engineering skills and domain expertise in designing rules and features for better algorithm accuracy. In addition, due to the structure and linguistic complexity of the patent text, constructing such rules and features is often a challenging task. In this paper, we investigate various variants of the Bi-LSTM model performance for NER task based on features generated automatically from an unlabelled genes and proteins patent corpora. The proposed model is able to capture the context representation of an input sequence and globally assign the related labels for each token. The CHARS-Bi-LSTM-EMA variant yielded the best performance and significantly outperformed the state-of-the art approach.