{"title":"基于短语的生物医学语义信息处理主题建模。","authors":"Zhiguo Yu, Todd R Johnson, Ramakanth Kavuluru","doi":"10.1109/ICMLA.2013.89","DOIUrl":null,"url":null,"abstract":"<p><p>Given that unstructured data is increasing exponentially everyday, extracting and understanding the information, themes, and relationships from large collections of documents is increasingly important to researchers in many disciplines including biomedicine. Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling technique based on the \"bag-of-words\" assumption that has been applied extensively to unveil hidden semantic themes within large sets of textual documents. Recently, it was extended using the \"bag-of-n-grams\" paradigm to account for word order. In this paper, we present an alternative phrase based LDA model to move from a bag of words or n-grams paradigm to a \"bag-of-key-phrases\" setting by applying a key phrase extraction technique, the C-value method, to further explore latent themes. We evaluate our approach by using a phrase intrusion user study and demonstrate that our model can help LDA generate better and more interpretable topics than those generated using the bag-of-n-grams approach. Given topic models essentially are statistical tools, an important problem in topic modeling is that of visualizing and interacting with the models to understand and extract new information from a collection. To evaluate our phrase based modeling approach in this context, we incorporate it in an open source interactive topic browser. Qualitative evaluations of this browser with biomedical experts demonstrate that our approach can aid biomedical researchers gain better and faster understanding of their document collections.</p>","PeriodicalId":74528,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications","volume":"2013 ","pages":"440-445"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICMLA.2013.89","citationCount":"12","resultStr":"{\"title\":\"Phrase Based Topic Modeling for Semantic Information Processing in Biomedicine.\",\"authors\":\"Zhiguo Yu, Todd R Johnson, Ramakanth Kavuluru\",\"doi\":\"10.1109/ICMLA.2013.89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Given that unstructured data is increasing exponentially everyday, extracting and understanding the information, themes, and relationships from large collections of documents is increasingly important to researchers in many disciplines including biomedicine. Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling technique based on the \\\"bag-of-words\\\" assumption that has been applied extensively to unveil hidden semantic themes within large sets of textual documents. Recently, it was extended using the \\\"bag-of-n-grams\\\" paradigm to account for word order. In this paper, we present an alternative phrase based LDA model to move from a bag of words or n-grams paradigm to a \\\"bag-of-key-phrases\\\" setting by applying a key phrase extraction technique, the C-value method, to further explore latent themes. We evaluate our approach by using a phrase intrusion user study and demonstrate that our model can help LDA generate better and more interpretable topics than those generated using the bag-of-n-grams approach. Given topic models essentially are statistical tools, an important problem in topic modeling is that of visualizing and interacting with the models to understand and extract new information from a collection. To evaluate our phrase based modeling approach in this context, we incorporate it in an open source interactive topic browser. Qualitative evaluations of this browser with biomedical experts demonstrate that our approach can aid biomedical researchers gain better and faster understanding of their document collections.</p>\",\"PeriodicalId\":74528,\"journal\":{\"name\":\"Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications\",\"volume\":\"2013 \",\"pages\":\"440-445\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/ICMLA.2013.89\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2013.89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2014/4/10 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2013.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2014/4/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
Phrase Based Topic Modeling for Semantic Information Processing in Biomedicine.
Given that unstructured data is increasing exponentially everyday, extracting and understanding the information, themes, and relationships from large collections of documents is increasingly important to researchers in many disciplines including biomedicine. Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling technique based on the "bag-of-words" assumption that has been applied extensively to unveil hidden semantic themes within large sets of textual documents. Recently, it was extended using the "bag-of-n-grams" paradigm to account for word order. In this paper, we present an alternative phrase based LDA model to move from a bag of words or n-grams paradigm to a "bag-of-key-phrases" setting by applying a key phrase extraction technique, the C-value method, to further explore latent themes. We evaluate our approach by using a phrase intrusion user study and demonstrate that our model can help LDA generate better and more interpretable topics than those generated using the bag-of-n-grams approach. Given topic models essentially are statistical tools, an important problem in topic modeling is that of visualizing and interacting with the models to understand and extract new information from a collection. To evaluate our phrase based modeling approach in this context, we incorporate it in an open source interactive topic browser. Qualitative evaluations of this browser with biomedical experts demonstrate that our approach can aid biomedical researchers gain better and faster understanding of their document collections.