PaSTiLa: Scalable Parallel Algorithm for Unsupervised Labeling of Long Time Series

IF 0.8 Q2 MATHEMATICS

Lobachevskii Journal of Mathematics Pub Date : 2024-07-19 DOI:10.1134/s1995080224600766

M. L. Zymbler, A. I. Goglachev

{"title":"PaSTiLa: Scalable Parallel Algorithm for Unsupervised Labeling of Long Time Series","authors":"M. L. Zymbler, A. I. Goglachev","doi":"10.1134/s1995080224600766","DOIUrl":null,"url":null,"abstract":"<h3 data-test=\"abstract-sub-heading\">Abstract</h3>Summarization aims at discovering a small set of typical subsequences (patterns) in the given long time series that represent the whole series. Further, one can implement unsupervised labeling of the given time series by assigning each subsequence a tag that corresponds to its most similar pattern. In the previous research, we developed the PSF (Parallel Snippet-Finder) algorithm for the time series summarization on GPU, where a snippet is the given-length subsequence, which is similar to many other subsequences w.r.t. the bespoke distance measure MPdist. However, PSF is limited by the demand that the snippet length be predefined by a domain expert. In this article, we introduce the novel parallel algorithm PaSTiLa (Parallel Snippet-based Time series Labeling) that discovers snippets and produces the labeling of the given time series on an HPC cluster with GPU nodes. As opposed to its predecessor, PaSTiLa employs the automatic selection of the snippet length from the specified range through our proposed heuristic criterion. In the experiments on labeling quality over time series from the TSSB (Time Series Segmentation Benchmark) dataset, PaSTiLa outperforms state-of-the-art segmentation-based competitors in average \\(\\textrm{F}_{1}\\) score. In the case of long-length time series (typically more than 8–10 K points), PaSTiLa outruns the rivals. Finally, over the million-length time series, our algorithm demonstrates a close-to-linear speedup.","PeriodicalId":46135,"journal":{"name":"Lobachevskii Journal of Mathematics","volume":"2 1","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lobachevskii Journal of Mathematics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1134/s1995080224600766","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Summarization aims at discovering a small set of typical subsequences (patterns) in the given long time series that represent the whole series. Further, one can implement unsupervised labeling of the given time series by assigning each subsequence a tag that corresponds to its most similar pattern. In the previous research, we developed the PSF (Parallel Snippet-Finder) algorithm for the time series summarization on GPU, where a snippet is the given-length subsequence, which is similar to many other subsequences w.r.t. the bespoke distance measure MPdist. However, PSF is limited by the demand that the snippet length be predefined by a domain expert. In this article, we introduce the novel parallel algorithm PaSTiLa (Parallel Snippet-based Time series Labeling) that discovers snippets and produces the labeling of the given time series on an HPC cluster with GPU nodes. As opposed to its predecessor, PaSTiLa employs the automatic selection of the snippet length from the specified range through our proposed heuristic criterion. In the experiments on labeling quality over time series from the TSSB (Time Series Segmentation Benchmark) dataset, PaSTiLa outperforms state-of-the-art segmentation-based competitors in average \(\textrm{F}_{1}\) score. In the case of long-length time series (typically more than 8–10 K points), PaSTiLa outruns the rivals. Finally, over the million-length time series, our algorithm demonstrates a close-to-linear speedup.

Abstract Image

查看原文本刊更多论文

PaSTiLa：用于长时间序列无监督标记的可扩展并行算法

摘要总结的目的是在给定的长时间序列中发现一小部分代表整个序列的典型子序列（模式）。此外，通过为每个子序列分配与其最相似模式相对应的标签，还可以对给定的时间序列进行无监督标记。在之前的研究中，我们为 GPU 上的时间序列汇总开发了 PSF（Parallel Snippet-Finder，并行片段查找）算法，其中片段是给定长度的子序列，在定制距离度量 MPdist 的作用下，它与许多其他子序列相似。然而，PSF 受限于片段长度必须由领域专家预先定义的要求。在本文中，我们将介绍一种新颖的并行算法 PaSTiLa（基于片段的并行时间序列标注），该算法能在带有 GPU 节点的高性能计算集群上发现片段并生成给定时间序列的标注。与前者相比，PaSTiLa 通过我们提出的启发式标准，从指定范围内自动选择片段长度。在对来自TSSB（时间序列分割基准）数据集的时间序列进行标注质量实验时，PaSTiLa的平均\(\textrm{F}_{1}\)得分优于基于分割技术的一流竞争对手。在长度较长的时间序列（通常超过 8-10 K 点）中，PaSTiLa 的表现超过了竞争对手。最后，在百万长度的时间序列中，我们的算法表现出了接近线性的提速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Lobachevskii Journal of Mathematics MATHEMATICS-

CiteScore

1.50

自引率

42.90%

发文量

127

期刊介绍： Lobachevskii Journal of Mathematics is an international peer reviewed journal published in collaboration with the Russian Academy of Sciences and Kazan Federal University. The journal covers mathematical topics associated with the name of famous Russian mathematician Nikolai Lobachevsky (Lobachevskii). The journal publishes research articles on geometry and topology, algebra, complex analysis, functional analysis, differential equations and mathematical physics, probability theory and stochastic processes, computational mathematics, mathematical modeling, numerical methods and program complexes, computer science, optimal control, and theory of algorithms as well as applied mathematics. The journal welcomes manuscripts from all countries in the English language.