Detection and evaluation of clusters within sequential data.

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery Pub Date : 2025-01-01 Epub Date: 2025-08-14 DOI:10.1007/s10618-025-01140-4

Alexander Van Werde, Albert Senen-Cerda, Gianluca Kosmella, Jaron Sanders

{"title":"Detection and evaluation of clusters within sequential data.","authors":"Alexander Van Werde, Albert Senen-Cerda, Gianluca Kosmella, Jaron Sanders","doi":"10.1007/s10618-025-01140-4","DOIUrl":null,"url":null,"abstract":"Sequential data is ubiquitous-it is routinely gathered to gain insights into complex processes such as behavioral, biological, or physical processes. Challengingly, such data not only has dependencies within the observed sequences, but the observations are also often high-dimensional, sparse, and noisy. These are all difficulties that obscure the inner workings of the complex process under study. One solution is to calculate a low-dimensional representation that describes (characteristics of) the complex process. This representation can then serve as a proxy to gain insight into the original process. However, uncovering such low-dimensional representation within sequential data is nontrivial due to the dependencies, and an algorithm specifically made for sequences is needed to guarantee estimator consistency. Fortunately, recent theoretical advancements on Block Markov Chains have resulted in new clustering algorithms that can provably do just this in synthetic sequential data. This paper presents a first field study of these new algorithms in real-world sequential data; a wide empirical study of clustering within a range of data sequences. We investigate broadly whether, when given sparse high-dimensional sequential data of real-life complex processes, useful low-dimensional representations can in fact be extracted using these algorithms. Concretely, we examine data sequences containing GPS coordinates describing animal movement, strands of human DNA, texts from English writing, and daily yields in a financial market. The low-dimensional representations we uncover are shown to not only successfully encode the sequential structure of the data, but also to enable gaining new insights into the underlying complex processes.Supplementary information: The online version contains supplementary material available at 10.1007/s10618-025-01140-4.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"39 6","pages":"69"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12354125/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-025-01140-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/14 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Sequential data is ubiquitous-it is routinely gathered to gain insights into complex processes such as behavioral, biological, or physical processes. Challengingly, such data not only has dependencies within the observed sequences, but the observations are also often high-dimensional, sparse, and noisy. These are all difficulties that obscure the inner workings of the complex process under study. One solution is to calculate a low-dimensional representation that describes (characteristics of) the complex process. This representation can then serve as a proxy to gain insight into the original process. However, uncovering such low-dimensional representation within sequential data is nontrivial due to the dependencies, and an algorithm specifically made for sequences is needed to guarantee estimator consistency. Fortunately, recent theoretical advancements on Block Markov Chains have resulted in new clustering algorithms that can provably do just this in synthetic sequential data. This paper presents a first field study of these new algorithms in real-world sequential data; a wide empirical study of clustering within a range of data sequences. We investigate broadly whether, when given sparse high-dimensional sequential data of real-life complex processes, useful low-dimensional representations can in fact be extracted using these algorithms. Concretely, we examine data sequences containing GPS coordinates describing animal movement, strands of human DNA, texts from English writing, and daily yields in a financial market. The low-dimensional representations we uncover are shown to not only successfully encode the sequential structure of the data, but also to enable gaining new insights into the underlying complex processes.

Supplementary information: The online version contains supplementary material available at 10.1007/s10618-025-01140-4.

查看原文本刊更多论文

序列数据中聚类的检测和评估。

顺序数据是无处不在的，它通常被收集来洞察复杂的过程，如行为、生物或物理过程。具有挑战性的是，这些数据不仅在观测序列中具有依赖性，而且观测结果通常是高维的、稀疏的和有噪声的。这些困难都掩盖了正在研究的复杂过程的内部工作原理。一种解决方案是计算描述复杂过程（特征）的低维表示。然后，这种表示可以作为代理来深入了解原始流程。然而，由于依赖关系，在序列数据中发现这种低维表示是不平凡的，并且需要为序列专门设计的算法来保证估计器的一致性。幸运的是，最近关于块马尔可夫链的理论进展已经产生了新的聚类算法，可以在合成序列数据中证明这一点。本文首次在实际序列数据中对这些新算法进行了实地研究；在一系列数据序列中对聚类进行广泛的实证研究。我们广泛地研究了当给定现实生活中复杂过程的稀疏高维序列数据时，是否可以使用这些算法提取有用的低维表示。具体地说，我们研究了包含描述动物运动的GPS坐标、人类DNA链、英语写作文本和金融市场每日收益的数据序列。我们发现的低维表示不仅可以成功地对数据的顺序结构进行编码，还可以获得对潜在复杂过程的新见解。补充信息：在线版本包含补充资料，可在10.1007/s10618-025-01140-4获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.