{"title":"Qwen2.5-Math 技术报告:通过自我完善建立数学专家模型","authors":"An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang","doi":"arxiv-2409.12122","DOIUrl":null,"url":null,"abstract":"In this report, we present a series of math-specific large language models:\nQwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the\nQwen2.5 series lies in integrating the philosophy of self-improvement\nthroughout the entire pipeline, from pre-training and post-training to\ninference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized\nto generate large-scale, high-quality mathematical data. (2) In the\npost-training phase, we develop a reward model (RM) by conducting massive\nsampling from Qwen2-Math-Instruct. This RM is then applied to the iterative\nevolution of data in supervised fine-tuning (SFT). With a stronger SFT model,\nit's possible to iteratively train and update the RM, which in turn guides the\nnext round of SFT data iteration. On the final SFT model, we employ the\nultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct.\n(3) Furthermore, during the inference stage, the RM is used to guide sampling,\noptimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced\nmathematical reasoning capabilities, including Chain-of-Thought (CoT) and\nTool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics\ndatasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and\nAIME24, covering a range of difficulties from grade school level to math\ncompetition problems.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement\",\"authors\":\"An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang\",\"doi\":\"arxiv-2409.12122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this report, we present a series of math-specific large language models:\\nQwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the\\nQwen2.5 series lies in integrating the philosophy of self-improvement\\nthroughout the entire pipeline, from pre-training and post-training to\\ninference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized\\nto generate large-scale, high-quality mathematical data. (2) In the\\npost-training phase, we develop a reward model (RM) by conducting massive\\nsampling from Qwen2-Math-Instruct. This RM is then applied to the iterative\\nevolution of data in supervised fine-tuning (SFT). With a stronger SFT model,\\nit's possible to iteratively train and update the RM, which in turn guides the\\nnext round of SFT data iteration. On the final SFT model, we employ the\\nultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct.\\n(3) Furthermore, during the inference stage, the RM is used to guide sampling,\\noptimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced\\nmathematical reasoning capabilities, including Chain-of-Thought (CoT) and\\nTool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics\\ndatasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and\\nAIME24, covering a range of difficulties from grade school level to math\\ncompetition problems.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12122\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
In this report, we present a series of math-specific large language models:
Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the
Qwen2.5 series lies in integrating the philosophy of self-improvement
throughout the entire pipeline, from pre-training and post-training to
inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized
to generate large-scale, high-quality mathematical data. (2) In the
post-training phase, we develop a reward model (RM) by conducting massive
sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative
evolution of data in supervised fine-tuning (SFT). With a stronger SFT model,
it's possible to iteratively train and update the RM, which in turn guides the
next round of SFT data iteration. On the final SFT model, we employ the
ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct.
(3) Furthermore, during the inference stage, the RM is used to guide sampling,
optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced
mathematical reasoning capabilities, including Chain-of-Thought (CoT) and
Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics
datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and
AIME24, covering a range of difficulties from grade school level to math
competition problems.