Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Khan, Awais; Paul, Arnab K.; Zimmer, Christopher; Oral, Sarp; Dash, Sajal; Atchley, Scott {Leadership Computing}; Wang, Feiyi

doi:10.1109/CLUSTER51413.2022.00044

Title: Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Conference · Thu Sep 01 00:00:00 EDT 2022

DOI:https://doi.org/10.1109/CLUSTER51413.2022.00044· OSTI ID:1902810

Khan, Awais ^[1]; Paul, Arnab K. ^[2];

^[1];

^[1]; Dash, Sajal ^[1];

^[1];

^[1]

ORNL
Virginia Polytechnic Institute and State University

Scientific communities are increasingly adopting deep learning (DL) models in their applications to accelerate scientific discovery processes. However, with rapid growth in the computing capabilities of HPC supercomputers, large-scale DL applications have to spend a significant portion of training time performing I/O to a parallel storage system. Previous research works have investigated optimization techniques such as prefetching and caching. Unfortunately, there exist non-trivial challenges to adopting the existing solutions on HPC supercomputers for large-scale DL training applications, which include non-performance and/or failures at extreme scale, lack of portability and generality in design, complex deployment methodology, and being limited to a specific application or dataset. To address these challenges, we propose High-Velocity AI Cache (HVAC), a distributed read-cache layer that targets and fully exploits the node-local storage or near node-local storage technology. HVAC seamlessly accelerates read I/O by aggregating node-local or near node-local storage, avoiding metadata lookups and file locking while preserving portability in the application code. We deploy and evaluate HVAC on 1,024 nodes (with over 6000 NVIDIA V100 GPUS) of the Summit supercomputer. In particular, we evaluate the scalability, efficiency, accuracy, and load distribution of HVAC compared to GPFS and XFS-on-NVMe. With four different DL applications, we observe an average 25 % performance improvement atop GPFS and 9% drop against XFS-on-NVMe, which scale linearly and are considered the performance upper bound. We envision HVAC as an important caching library for upcoming HPC supercomputers such as Frontier.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1902810

Resource Relation:: Conference: IEEE International Conference on Cluster Computing (CLUSTER 2022) - Heidelberg, , Germany - 9/6/2022 12:00:00 PM-9/9/2022 12:00:00 PM

Country of Publication:: United States

Language:: English

Similar Records

Accelerating Application Bulk Synchronous Writes in HPC Environments

Conference · Sat Jun 01 00:00:00 EDT 2024 · OSTI ID:1902810

Khan, Awais; Zimmer, Christopher; Atchley, Scott {Leadership Computing}; +3 more

Efficient I/O for Neural Network Training with Compressed Data

Conference · Wed Jan 01 00:00:00 EST 2020 · OSTI ID:1902810

Zhang, Zhao; Huang, Lei; Pauloski, J. Gregory; +1 more

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report · Mon Feb 21 00:00:00 EST 2022 · OSTI ID:1902810

Dai, Donglai

Title: Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications

Citation Formats

Similar Records

Related Subjects