DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Maurya, Avinash; Underwood, Robert; Rafique, M. Mustafa; Cappello, Franck; Nicolae, Bogdan

doi:10.1145/3625549.3658685

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2406.10707 (cs)

[Submitted on 15 Jun 2024]

Title:DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Authors:Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

View PDF HTML (experimental)

Abstract:LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$\times$ faster checkpointing and 2.2$\times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.

Comments:	Published at HPDC '24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing. Source code at this https URL
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2406.10707 [cs.DC]
	(or arXiv:2406.10707v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2406.10707
Related DOI:	https://doi.org/10.1145/3625549.3658685

Submission history

From: Avinash Maurya [view email]
[v1] Sat, 15 Jun 2024 18:30:40 UTC (894 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators