Alexander Kinsora, Kate Barron, Q. Mei, V.G.Vinod Vydiswaran
{"title":"为健康论坛中的医疗错误信息创建标记数据集","authors":"Alexander Kinsora, Kate Barron, Q. Mei, V.G.Vinod Vydiswaran","doi":"10.1109/ICHI.2017.93","DOIUrl":null,"url":null,"abstract":"The dissemination of medical misinformation online presents a challenge to human health. Machine learning techniques provide a unique opportunity for decreasing the cognitive load associated with deciding upon whether any given user comment is likely to contain misinformation, but a paucity of labeled data of medical misinformation makes supervised approaches a challenge. In order to ameliorate this condition, we present a new labeled dataset of misinformative and non-misinformative comments developed over posted questions and comments on a health discussion forum. This required extraction of candidate misinformative entries from the corpus using information retrieval techniques, development of a codex and labeling strategy for the dataset, and the creation of features for use in machine learning tasks. By identifying the nine most descriptive features with regard to classification as misinformative or non-misinformative through the use of Recursive Feature Elimination, we achieved a classification accuracy of 90.1%, where the dataset is comprised 85.8% of non-misinformative comments. In our opinion, this dataset and analysis will aid the machine learning community in the development of an online misinformation classification system over user-generated content such as medical forum posts.","PeriodicalId":263611,"journal":{"name":"2017 IEEE International Conference on Healthcare Informatics (ICHI)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Creating a Labeled Dataset for Medical Misinformation in Health Forums\",\"authors\":\"Alexander Kinsora, Kate Barron, Q. Mei, V.G.Vinod Vydiswaran\",\"doi\":\"10.1109/ICHI.2017.93\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The dissemination of medical misinformation online presents a challenge to human health. Machine learning techniques provide a unique opportunity for decreasing the cognitive load associated with deciding upon whether any given user comment is likely to contain misinformation, but a paucity of labeled data of medical misinformation makes supervised approaches a challenge. In order to ameliorate this condition, we present a new labeled dataset of misinformative and non-misinformative comments developed over posted questions and comments on a health discussion forum. This required extraction of candidate misinformative entries from the corpus using information retrieval techniques, development of a codex and labeling strategy for the dataset, and the creation of features for use in machine learning tasks. By identifying the nine most descriptive features with regard to classification as misinformative or non-misinformative through the use of Recursive Feature Elimination, we achieved a classification accuracy of 90.1%, where the dataset is comprised 85.8% of non-misinformative comments. In our opinion, this dataset and analysis will aid the machine learning community in the development of an online misinformation classification system over user-generated content such as medical forum posts.\",\"PeriodicalId\":263611,\"journal\":{\"name\":\"2017 IEEE International Conference on Healthcare Informatics (ICHI)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Healthcare Informatics (ICHI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICHI.2017.93\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Healthcare Informatics (ICHI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICHI.2017.93","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Creating a Labeled Dataset for Medical Misinformation in Health Forums
The dissemination of medical misinformation online presents a challenge to human health. Machine learning techniques provide a unique opportunity for decreasing the cognitive load associated with deciding upon whether any given user comment is likely to contain misinformation, but a paucity of labeled data of medical misinformation makes supervised approaches a challenge. In order to ameliorate this condition, we present a new labeled dataset of misinformative and non-misinformative comments developed over posted questions and comments on a health discussion forum. This required extraction of candidate misinformative entries from the corpus using information retrieval techniques, development of a codex and labeling strategy for the dataset, and the creation of features for use in machine learning tasks. By identifying the nine most descriptive features with regard to classification as misinformative or non-misinformative through the use of Recursive Feature Elimination, we achieved a classification accuracy of 90.1%, where the dataset is comprised 85.8% of non-misinformative comments. In our opinion, this dataset and analysis will aid the machine learning community in the development of an online misinformation classification system over user-generated content such as medical forum posts.