Analysis of 2019 Ohio Disease Intervention Specialist (DIS) Records for Syphilis Cases Using Clustering Algorithms.

IF 2.4 4区 医学 Q3 INFECTIOUS DISEASES
Payal Chakraborty, Xia Ning, Mary McNeill, David M Kline, Abigail B Shoben, William C Miller, Abigail Norris Turner
{"title":"Analysis of 2019 Ohio Disease Intervention Specialist (DIS) Records for Syphilis Cases Using Clustering Algorithms.","authors":"Payal Chakraborty, Xia Ning, Mary McNeill, David M Kline, Abigail B Shoben, William C Miller, Abigail Norris Turner","doi":"10.1097/OLQ.0000000000002091","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Developments in natural language processing (NLP) and unsupervised machine learning methodologies (e.g., clustering) have given researchers new tools to analyze both structured and unstructured health data. We applied these methods to 2019 Ohio disease intervention specialist (DIS) syphilis records, to determine whether these methods can uncover novel patterns of co-occurrence of individual characteristics, risk factors, and clinical characteristics of syphilis that are not yet reported in the literature.</p><p><strong>Methods: </strong>The 2019 DIS syphilis records (n=1,996) contain both structured data (categorical and numerical variables) and unstructured notes. In the structured data, we examined case demographics, syphilis risk factors, and clinical characteristics of syphilis. For the unstructured text, we applied TF-IDF (term frequency multiplied by inverse document frequency) weights, a common way to convert text into numerical representations. We performed agglomerative clustering with cosine similarity using the CLUTO software.</p><p><strong>Results: </strong>The cluster analysis yielded six clusters of syphilis cases based on patterns in the structured and unstructured data. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well-formed. The factors underlying three of the clusters related to patterns of missing data. The factors underlying the other three clusters were sexual behaviors and partnerships. Notably, one of the three consisted of individuals who reported oral sex with male or anonymous partners while intoxicated, and one was comprised mainly of males who have sex with females.</p><p><strong>Conclusions: </strong>Our analysis resulted in clusters that were well-formed mathematically, but did not reveal novel epidemiological information about syphilis risk factors or transmission that were not already known.</p>","PeriodicalId":21837,"journal":{"name":"Sexually transmitted diseases","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sexually transmitted diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/OLQ.0000000000002091","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Developments in natural language processing (NLP) and unsupervised machine learning methodologies (e.g., clustering) have given researchers new tools to analyze both structured and unstructured health data. We applied these methods to 2019 Ohio disease intervention specialist (DIS) syphilis records, to determine whether these methods can uncover novel patterns of co-occurrence of individual characteristics, risk factors, and clinical characteristics of syphilis that are not yet reported in the literature.

Methods: The 2019 DIS syphilis records (n=1,996) contain both structured data (categorical and numerical variables) and unstructured notes. In the structured data, we examined case demographics, syphilis risk factors, and clinical characteristics of syphilis. For the unstructured text, we applied TF-IDF (term frequency multiplied by inverse document frequency) weights, a common way to convert text into numerical representations. We performed agglomerative clustering with cosine similarity using the CLUTO software.

Results: The cluster analysis yielded six clusters of syphilis cases based on patterns in the structured and unstructured data. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well-formed. The factors underlying three of the clusters related to patterns of missing data. The factors underlying the other three clusters were sexual behaviors and partnerships. Notably, one of the three consisted of individuals who reported oral sex with male or anonymous partners while intoxicated, and one was comprised mainly of males who have sex with females.

Conclusions: Our analysis resulted in clusters that were well-formed mathematically, but did not reveal novel epidemiological information about syphilis risk factors or transmission that were not already known.

使用聚类算法分析 2019 年俄亥俄州疾病干预专家 (DIS) 记录的梅毒病例。
背景:自然语言处理(NLP)和无监督机器学习方法(如聚类)的发展为研究人员提供了分析结构化和非结构化健康数据的新工具。我们将这些方法应用于2019年俄亥俄州疾病干预专家(DIS)梅毒记录,以确定这些方法是否能发现文献中尚未报道的梅毒个体特征、风险因素和临床特征共同出现的新模式:2019 年 DIS 梅毒记录(n=1,996)包含结构化数据(分类和数字变量)和非结构化笔记。在结构化数据中,我们研究了病例人口统计学、梅毒风险因素和梅毒临床特征。对于非结构化文本,我们采用了 TF-IDF(词频乘以反向文档频率)权重,这是一种将文本转换为数字表示的常用方法。我们使用 CLUTO 软件进行了余弦相似性聚类分析:聚类分析根据结构化和非结构化数据中的模式得出了六个梅毒病例聚类。平均内部相似性远高于平均外部相似性,这表明聚类是有序形成的。其中三个聚类的基本因素与数据缺失模式有关。另外三个聚类的基本因素是性行为和伙伴关系。值得注意的是,三个聚类中的一个聚类由报告在醉酒时与男性或匿名伴侣发生口交的个人组成,另一个聚类主要由与女性发生性关系的男性组成:我们的分析得出了在数学上形成良好的聚类,但并没有揭示出梅毒风险因素或传播方面未知的新流行病学信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Sexually transmitted diseases
Sexually transmitted diseases 医学-传染病学
CiteScore
4.00
自引率
16.10%
发文量
289
审稿时长
3-8 weeks
期刊介绍: ​Sexually Transmitted Diseases, the official journal of the American Sexually Transmitted Diseases Association​, publishes peer-reviewed, original articles on clinical, laboratory, immunologic, epidemiologic, behavioral, public health, and historical topics pertaining to sexually transmitted diseases and related fields. Reports from the CDC and NIH provide up-to-the-minute information. A highly respected editorial board is composed of prominent scientists who are leaders in this rapidly changing field. Included in each issue are studies and developments from around the world.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信