Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Noroozi, Vahid; Majumdar, Somshubra; Kumar, Ankur; Balam, Jagadeesh; Ginsburg, Boris

Computer Science > Computation and Language

arXiv:2312.17279 (cs)

[Submitted on 27 Dec 2023 (v1), last revised 2 May 2024 (this version, v3)]

Title:Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Authors:Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

View PDF HTML (experimental)

Abstract:In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.

Comments:	Shorter version accepted to ICASSP 2024
Subjects:	Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2312.17279 [cs.CL]
	(or arXiv:2312.17279v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.17279

Submission history

From: Vahid Noroozi [view email]
[v1] Wed, 27 Dec 2023 21:04:26 UTC (124 KB)
[v2] Thu, 11 Jan 2024 19:19:16 UTC (145 KB)
[v3] Thu, 2 May 2024 21:38:10 UTC (147 KB)

Computer Science > Computation and Language

Title:Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators