Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Dong, Guanting; Lu, Keming; Li, Chengpeng; Xia, Tingyu; Yu, Bowen; Zhou, Chang; Zhou, Jingren

Computer Science > Computation and Language

arXiv:2406.13542 (cs)

[Submitted on 19 Jun 2024 (v1), last revised 18 Jul 2024 (this version, v3)]

Title:Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Authors:Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

View PDF HTML (experimental)

Abstract:One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at this https URL.

Comments:	Work in progress
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.13542 [cs.CL]
	(or arXiv:2406.13542v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.13542

Submission history

From: Guanting Dong [view email]
[v1] Wed, 19 Jun 2024 13:29:53 UTC (1,958 KB)
[v2] Wed, 17 Jul 2024 14:33:35 UTC (1,959 KB)
[v3] Thu, 18 Jul 2024 09:00:23 UTC (1,959 KB)

Computer Science > Computation and Language

Title:Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators