{"title":"Super Monotonic Alignment Search","authors":"Junhyeok Lee, Hyeongju Kim","doi":"arxiv-2409.07704","DOIUrl":null,"url":null,"abstract":"Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most\npopular algorithm in TTS to estimate unknown alignments between text and\nspeech. Since this algorithm needs to search for the most probable alignment\nwith dynamic programming by caching all paths, the time complexity of the\nalgorithm is $O(T \\times S)$. The authors of Glow-TTS run this algorithm on\nCPU, and while they mentioned it is difficult to parallelize, we found that MAS\ncan be parallelized in text-length dimension and CPU execution consumes an\ninordinate amount of time for inter-device copy. Therefore, we implemented a\nTriton kernel and PyTorch JIT script to accelerate MAS on GPU without\ninter-device copy. As a result, Super-MAS Triton kernel is up to 72 times\nfaster in the extreme-length case. The code is available at\n\\url{https://github.com/supertone-inc/super-monotonic-align}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most
popular algorithm in TTS to estimate unknown alignments between text and
speech. Since this algorithm needs to search for the most probable alignment
with dynamic programming by caching all paths, the time complexity of the
algorithm is $O(T \times S)$. The authors of Glow-TTS run this algorithm on
CPU, and while they mentioned it is difficult to parallelize, we found that MAS
can be parallelized in text-length dimension and CPU execution consumes an
inordinate amount of time for inter-device copy. Therefore, we implemented a
Triton kernel and PyTorch JIT script to accelerate MAS on GPU without
inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times
faster in the extreme-length case. The code is available at
\url{https://github.com/supertone-inc/super-monotonic-align}.