{"title":"SpeechAlign: Aligning Speech Generation to Human Preferences","authors":"Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu","doi":"arxiv-2404.05600","DOIUrl":null,"url":null,"abstract":"Speech language models have significantly advanced in generating realistic\nspeech, with neural codec language models standing out. However, the\nintegration of human feedback to align speech outputs to human preferences is\noften neglected. This paper addresses this gap by first analyzing the\ndistribution gap in codec language models, highlighting how it leads to\ndiscrepancies between the training and inference phases, which negatively\naffects performance. Then we explore leveraging learning from human feedback to\nbridge the distribution gap. We introduce SpeechAlign, an iterative\nself-improvement strategy that aligns speech language models to human\npreferences. SpeechAlign involves constructing a preference codec dataset\ncontrasting golden codec tokens against synthetic tokens, followed by\npreference optimization to improve the codec language model. This cycle of\nimprovement is carried out iteratively to steadily convert weak models to\nstrong ones. Through both subjective and objective evaluations, we show that\nSpeechAlign can bridge the distribution gap and facilitating continuous\nself-improvement of the speech language model. Moreover, SpeechAlign exhibits\nrobust generalization capabilities and works for smaller models. Code and\nmodels will be available at https://github.com/0nutation/SpeechGPT.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.05600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech language models have significantly advanced in generating realistic
speech, with neural codec language models standing out. However, the
integration of human feedback to align speech outputs to human preferences is
often neglected. This paper addresses this gap by first analyzing the
distribution gap in codec language models, highlighting how it leads to
discrepancies between the training and inference phases, which negatively
affects performance. Then we explore leveraging learning from human feedback to
bridge the distribution gap. We introduce SpeechAlign, an iterative
self-improvement strategy that aligns speech language models to human
preferences. SpeechAlign involves constructing a preference codec dataset
contrasting golden codec tokens against synthetic tokens, followed by
preference optimization to improve the codec language model. This cycle of
improvement is carried out iteratively to steadily convert weak models to
strong ones. Through both subjective and objective evaluations, we show that
SpeechAlign can bridge the distribution gap and facilitating continuous
self-improvement of the speech language model. Moreover, SpeechAlign exhibits
robust generalization capabilities and works for smaller models. Code and
models will be available at https://github.com/0nutation/SpeechGPT.