TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Shangguan, Yuan; Yang, Haichuan; Li, Danni; Wu, Chunyang; Fathullah, Yassir; Wang, Dilin; Dalmia, Ayushi; Krishnamoorthi, Raghuraman; Kalinli, Ozlem; Jia, Junteng; Mahadeokar, Jay; Lei, Xin; Seltzer, Mike; Chandra, Vikas

Computer Science > Computation and Language

arXiv:2309.01947 (cs)

[Submitted on 5 Sep 2023 (v1), last revised 27 Nov 2023 (this version, v2)]

Title:TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Authors:Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra

View PDF

Abstract:Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.

Comments:	Meta AI; Submitted to ICASSP 2024
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2309.01947 [cs.CL]
	(or arXiv:2309.01947v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.01947

Submission history

From: Yuan Shangguan [view email]
[v1] Tue, 5 Sep 2023 04:47:55 UTC (179 KB)
[v2] Mon, 27 Nov 2023 05:03:31 UTC (1,397 KB)

Computer Science > Computation and Language

Title:TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators