T5 ASR Grammar Corrector
For my master project, Curious AI, I had trialed several VAD and ASR models to try and find the right fit for the live WebRTC real-time streaming architecture.
After a lot of benchmarking and tweaking, I settled on pyannote for VAD and NVIDIA NeMo for ASR.
The issue was, with the way I had set up the two for real-time inference using a sliding window approach, the ASR output was fraught with grammar issues, incomplete tokens or partial and repeated tokens.
I tried several ASR grammar correction models, but they were either too heavy and added too much latency to my pipeline, required more tokens than my short realtime window was designed for to be accurate or were so light weight they did nothing.
I needed something that didn't yet exist. So I set to create my own model.
I quickly learned that building an effective grammar correction model for automatic speech recognition (ASR) outputs isn't easy. The biggest hurdle was the training data. ASR outputs are messy. They drop words, repeat others, mangle grammar, and often give you a stream that feels more like an electronic stutter than a complete sentence. I wanted to create a model that could clean that up in real-time.
Here’s how I created the massive training dataset that powers my ASR Grammar Correction models.
Generating a Clean Language Foundation with Mistral
Before simulating ASR-style mistakes, I needed a solid base of grammatically correct English. I used a local instance of Mistral to generate over 40 million clean English sentences. These were varied, natural, and contextually rich with different subject and industries. The output was fluid and had common human speech habits, not like a well spoken audio book narrator.
This clean corpus became the “target” side of my dataset: what the model should learn to aim for.
Simulating ASR Chaos With 90 Million Noisy/Clean Pairs
Clean text alone doesn't teach a model how to fix bad grammar, I had to teach it what “bad” looked like.
I wrote a series of custom noise functions that mimic the kinds of errors common in ASR systems:
Word repetitions and stutters (e.g., "I I went to the store")
Homophone swaps (“there” vs. “their”)
Dropped articles and auxiliaries
Weird token merges and run-ons
Spacing issues, casing errors, disfluencies , the works (in total 10 functions were used to generate common ASR issues)
noise_functions = [
homophone_mistake,
break_subject_verb_agreement,
drop_auxiliaries,
corrupt_verb_tense,
corrupt_wh_question,
asr_contraction_noise,
asr_article_noise,
asr_preposition_noise,
asr_pronoun_substitution,
duplicate_random_word,
]Using these, I generated 90 million paired samples of noisy input and clean target sentences. This became the core dataset for training my models.
Training T5 Models on the Dataset
With this massive corpus of noisy/clean pairs, I fine-tuned both T5-small and T5-base models. These models were chosen for their balance of speed and performance — important because my goal was real-time correction, not just academic accuracy.
Early results were promising: the models quickly learned to remove common stutters, fix missing words, and even untangle long messy clauses.
But real ASR data has more quirks than just bad grammar.
Sliding Window ASR-style Noise - Simulating Streamed Input
To go a step further, I generated a second variant of the dataset designed to simulate streaming ASR output, where text arrives in small, overlapping windows. I introduced:
Partial tokens and mid-word cuts
Overlapping tokens across chunks
Fragmented clauses with no punctuation
Repetition due to temporal overlap
This new dataset mimicked how real-time transcription often works in live systems, where every half-second of speech might overlap with the previous. Training on this helped the model better understand temporal context and avoid getting tripped up by duplicated or incomplete inputs.
Overall I was very happy with the results, which you can see here. A good balance of accuracy and very low latency. While the streaming version does suffer slightly in accuracy, in my master project it's been near flawless.
| Model | Type | Precision | Latency (s/sample) | VRAM (MB) | BLEU | ROUGE-L | Accuracy (%)¹ | Token Accuracy (%)² | Size (MB) |
|---|---|---|---|---|---|---|---|---|---|
| dj-ai-asr-grammar-corrector-t5-base | HF | fp32 | 0.1151 | 24.98 | 78.92 | 90.31 | 44.62 | 90.39 | 5956.76 |
| dj-ai-asr-grammar-corrector-t5-small | HF | fp32 | 0.0648 | 6.27 | 76.47 | 89.54 | 39.59 | 88.76 | 1620.15 |
| dj-ai-asr-grammar-corrector-t5-small-streaming | HF | fp32 | 0.0634 | 14.77 | 76.25 | 89.61 | 39.9 | 88.54 | 1620.65 |
You can try the models here: https://huggingface.co/spaces/dayyanj/dj-ai-asr-grammar-corrector-demo
The success of this model has set me on a new path, to develop P.A.S.T.A (Perception-Aware Streaming Transcription Architecture) my next-gen speech model.
It aims to combine:
Voice Activity Detection (VAD)
Overlapping speaker diarization
Noise classification (521 classes)
Streaming ASR
Real-time grammar correction
Hierarchical memory for speech recall and reinforcement
P.A.S.T.A is designed to operate as a unified, branch-based neural architecture, enabling modular yet integrated audio perception and understanding. I've already created the dataset for training the VAD and Classification heads and commenced training on CNN, DEEPCNN (custom CNN+TDNN), ResNet18 and efficientnet backbones to see which provided the best performance. I'll update more as this progresses.