T5 ASR Grammar Corrector

For my master project, Curious AI, I had trialed several VAD and ASR models to try and find the right fit for the live WebRTC real-time streaming architecture. 

After a lot of benchmarking and tweaking, I settled on pyannote for VAD and NVIDIA NeMo for ASR.

 

The issue was, with the way I had set up the two for real-time inference using a sliding window approach, the ASR output was fraught with grammar issues, incomplete tokens or partial and repeated tokens.

 

I tried several ASR grammar correction models, but they were either too heavy and added too much latency to my pipeline, required more tokens than my short realtime window was designed for to be accurate or were so light weight they did nothing.

 

I needed something that didn't yet exist. So I set to create my own model.


I quickly learned that building an effective grammar correction model for automatic speech recognition (ASR) outputs isn't easy. The biggest hurdle was the training data. ASR outputs are messy. They drop words, repeat others, mangle grammar, and often give you a stream that feels more like an electronic stutter than a complete sentence. I wanted to create a model that could clean that up in real-time.

 

Here’s how I created the massive training dataset that powers my ASR Grammar Correction models.

 

Generating a Clean Language Foundation with Mistral

Before simulating ASR-style mistakes, I needed a solid base of grammatically correct English. I used a local instance of Mistral to generate over 40 million clean English sentences. These were varied, natural, and contextually rich with different subject and industries. The output was fluid and had common human speech habits, not like a well spoken audio book narrator.

 

This clean corpus became the “target” side of my dataset: what the model should learn to aim for.

 

Simulating ASR Chaos With 90 Million Noisy/Clean Pairs

Clean text alone doesn't teach a model how to fix bad grammar, I had to teach it what “bad” looked like.

I wrote a series of custom noise functions that mimic the kinds of errors common in ASR systems:

 

noise_functions = [
    homophone_mistake,
    break_subject_verb_agreement,
    drop_auxiliaries,
    corrupt_verb_tense,
    corrupt_wh_question,
    asr_contraction_noise,
    asr_article_noise,
    asr_preposition_noise,
    asr_pronoun_substitution,
    duplicate_random_word,
]

Using these, I generated 90 million paired samples of noisy input and clean target sentences. This became the core dataset for training my models.

 

Training T5 Models on the Dataset

With this massive corpus of noisy/clean pairs, I fine-tuned both T5-small and T5-base models. These models were chosen for their balance of speed and performance — important because my goal was real-time correction, not just academic accuracy.

Early results were promising: the models quickly learned to remove common stutters, fix missing words, and even untangle long messy clauses.

But real ASR data has more quirks than just bad grammar.

 

Sliding Window ASR-style Noise - Simulating Streamed Input

To go a step further, I generated a second variant of the dataset designed to simulate streaming ASR output, where text arrives in small, overlapping windows. I introduced:

This new dataset mimicked how real-time transcription often works in live systems, where every half-second of speech might overlap with the previous. Training on this helped the model better understand temporal context and avoid getting tripped up by duplicated or incomplete inputs.

 

Overall I was very happy with the results, which you can see here. A good balance of accuracy and very low latency. While the streaming version does suffer slightly in accuracy, in my master project it's been near flawless.

 

ModelTypePrecisionLatency (s/sample)VRAM (MB)BLEUROUGE-LAccuracy (%)¹Token Accuracy (%)²Size (MB)
dj-ai-asr-grammar-corrector-t5-baseHFfp320.115124.9878.9290.3144.6290.395956.76
dj-ai-asr-grammar-corrector-t5-smallHFfp320.06486.2776.4789.5439.5988.761620.15
dj-ai-asr-grammar-corrector-t5-small-streamingHFfp320.063414.7776.2589.6139.988.541620.65

 

You can try the models here: https://huggingface.co/spaces/dayyanj/dj-ai-asr-grammar-corrector-demo

 

The success of this model has set me on a new path, to develop P.A.S.T.A (Perception-Aware Streaming Transcription Architecture) my next-gen speech model. 

It aims to combine:

P.A.S.T.A is designed to operate as a unified, branch-based neural architecture, enabling modular yet integrated audio perception and understanding. I've already created the dataset for training the VAD and Classification heads and commenced training on CNN, DEEPCNN (custom CNN+TDNN), ResNet18 and efficientnet backbones to see which provided the best performance. I'll update more as this progresses.