Abstract:

In this paper, the authors propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. They address the issue of accuracy disparity between training and inference time that is often encountered in streaming models. To achieve this, the authors adapted the FastConformer architecture for streaming applications by constraining both the look-ahead and past contexts in the encoder. They also introduced an activation caching mechanism that enables the non-autoregressive encoder to operate autoregressively during inference.

One interesting aspect of the proposed model is its versatility, as it can work with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. The authors even introduce a hybrid CTC/RNNT architecture that utilizes a shared encoder with both a CTC and RNNT decoder. This hybrid architecture not only boosts accuracy but also saves computation.

To evaluate the effectiveness of their model, the authors conducted experiments using the LibriSpeech dataset and a multi-domain large scale dataset. The results showed that the proposed model achieved better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline.

An interesting finding from their experiments is that training a model with multiple latencies can improve accuracy compared to using a single latency model. Additionally, this approach enables support for multiple latencies using just a single model, which can be advantageous in practical applications.

Furthermore, the authors demonstrated that the hybrid architecture not only speeds up the convergence of the CTC decoder but also improves the accuracy of streaming models when compared to single decoder models.

In conclusion, the proposed efficient and accurate streaming speech recognition model based on the FastConformer architecture offers promising advancements in tackling the challenges of streaming applications. With its ability to handle different decoder configurations and its hybrid CTC/RNNT architecture, the model shows improved accuracy, lower latency, and reduced inference time. This research opens up new possibilities for enhancing real-time speech recognition systems in various domains.

Read the original article