{"title":"Boundary Detection by Determining the Difference of Classification Probabilities of Sequences: Topic Segmentation of Clinical Notes","authors":"W. Ruan, Won-sook Lee","doi":"10.1109/BIBM.2018.8621195","DOIUrl":null,"url":null,"abstract":"Topic segmentation of clinical notes is a significant issue in the information retrieval domain that could effectively help the process of diagnosis. In this study, we propose a methodology of topic segmentation to clinical notes with boundary detection by determining the difference of classification probabilities of sequences. With 1127 text plain clinical notes collected from I2B2 we experiment on 5 topics: medications, history, hospital course, laboratories and physical exams. The Naive Bayes and Linear SVM models with a selected feature of BOW are employed to train Topic Score Predictors that assign each sequence with a 5-dimensional vector $v_{i}$ in which each element represents the probability of the sequence belonging to a corresponding class. By analyzing the vector $\\rho = [v_{1},v_{2},\\cdots \\cdots v_{i}]$, the boundaries would be detected by finding the locations where topic scores have a rapid change. Famous Windiff, $P_{k}$ and $F_{1}$ Score metrics are used for evaluating our system. Segmenter based on Naive Bayes shows superior performance to that based on SVM model having 0.1468 for Windiff, 0.1221 for $P_{k}$ and averaged $F_{1}$ Score over 0.90.","PeriodicalId":108667,"journal":{"name":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2018.8621195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Topic segmentation of clinical notes is a significant issue in the information retrieval domain that could effectively help the process of diagnosis. In this study, we propose a methodology of topic segmentation to clinical notes with boundary detection by determining the difference of classification probabilities of sequences. With 1127 text plain clinical notes collected from I2B2 we experiment on 5 topics: medications, history, hospital course, laboratories and physical exams. The Naive Bayes and Linear SVM models with a selected feature of BOW are employed to train Topic Score Predictors that assign each sequence with a 5-dimensional vector $v_{i}$ in which each element represents the probability of the sequence belonging to a corresponding class. By analyzing the vector $\rho = [v_{1},v_{2},\cdots \cdots v_{i}]$, the boundaries would be detected by finding the locations where topic scores have a rapid change. Famous Windiff, $P_{k}$ and $F_{1}$ Score metrics are used for evaluating our system. Segmenter based on Naive Bayes shows superior performance to that based on SVM model having 0.1468 for Windiff, 0.1221 for $P_{k}$ and averaged $F_{1}$ Score over 0.90.