Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

Conference on Empirical Methods in Natural Language Processing Pub Date : 2023-10-09 DOI:10.18653/v1/2023.emnlp-main.944

Siyang Liu, Naihao Deng, Sahand Sabour, Yilin Jia, Minlie Huang, Rada Mihalcea

引用次数: 0

Abstract

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model's tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.

查看原文本刊更多论文

任务自适应标记化：提高心理健康及其他领域的长格式文本生成效率

我们提出了任务自适应标记化技术，以此使生成管道适应下游任务的具体情况，并增强心理健康领域的长表单生成能力。受认知科学的启发，我们的任务自适应标记化器从多个结果中采样可变的分段，并根据特定任务的数据优化采样概率。我们介绍了一种建立专门词汇的策略，并介绍了一种词汇合并协议，该协议允许将特定任务的标记整合到预训练模型的标记化步骤中。通过对中英文心理问题解答任务的广泛实验，我们发现我们的任务自适应标记化方法显著提高了生成性能，同时减少了多达 60% 的标记。初步实验表明，在使用我们的标记化方法和超大语言模型时，会取得很好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Conference on Empirical Methods in Natural Language Processing

自引率

0.00%

发文量