Update On The Disentangler Model
After training I found a few issues with the disentangler model (designed to separate noise and voice for downstream consumption)
The model has multiple output heads, including raw embeddings for voice and noise, PCEN embeddings for voice and noise and fused embeddings (raw+PCEN) for voice and noise.
There were issues with the timing of these embeddings from each head, I carried out a number of tests on various approaches to eliminate, as much as possible, any latency on each head as well as ensuring that the final outputs were all aligned so downstream consumers could marry up the embeddings across heads without any further latency.
This lead to a few other refactors to improve the overall streaming architecture to ensure the best possible throughput without sacrificing quality needed for learning and downstream needs. It also led to the need to bolter my training dataset. Initially I was training on a customised triplet dataset that consisted of 1 million noise / voice / mixed 2 second triplet wav files all in English, but to ensure the model would be truly language agnostic I added an additional 1 million triplets from a variety of languages with enough global phoneme variation to ensure that the model didn't get too comfortable with English only.
Now the model has:
- Precise synchronization: All embeddings guaranteed to represent same time points
- 100Hz target resolution: Required for harmonic/prosodic learning
- Biological accuracy: Matches auditory cortex temporal processing
- Streaming compatibility: Seamlessly switches between training/inference modes
- Multi-scale features: Raw (transients) + PCEN (harmonics) + Fused (integrated)
- Time-specific access: perfect for proto-syllable learning
Key Strengths:
Perfect for downstream proto-learning:
aligned_emb = model.get_aligned_embeddings_at_time(audio, time_ms=150)
Gets voice_raw, voice_pcen, voice_fused all from exactly 150ms
Voice path will learn:
- Formant patterns (from voice-only clips)
- Harmonic structures (from PCEN processing)
- Acoustic boundaries (from mixed vs voice comparison)
Proto-Syllable Discovery:
Temporal alignment enables:
- Onset/offset detection (voice vs mixed timing)
- Coarticulation patterns (voice evolution in mixed)
- Prosodic structure (stress patterns in voice)
Noise Robustness:
Critical for real-world deployment:
- Learns voice features that persist despite noise
- Develops noise-invariant representations
- Handles variable SNR conditions
- Model Architecture: Streaming-ready with temporal alignment
- Training Pipeline: Progressive phases with proper state management
- Dataset: Optimal for self-learning voice/noise separation
- Downstream Ready: Aligned embeddings perfect for my proto-syllable learning
Biological Inspiration: Mirrors auditory cortex development
Expected Training Outcomes:
- Phase 1: Raw encoder learns transient patterns, onsets, FM sweeps
- Phase 2: PCEN encoder learns harmonic structures, formants, pitch
- Phase 3: Fusion layer integrates multi-scale voice representations
- Phase 4: Joint training optimizes full voice/noise disentanglement
Result: A streaming-capable model that separates voice from noise with aligned multi-scale embeddings, ready for downstream
proto-phoneme/syllable learning and other auditory cortex modules.
The one big downside of all these enhancements - My batch-size has had to be reduced from 96 right down to 32 (even 48 hit the limits of my RTX4090). This isn't helpful for the NT-XENT separation which works better with larger batch sizes, and now with 1 million additional triplets in the dataset, a single epoch takes close to 6 hours and I am expecting to need at least 60+ epochs per training phase, so it's going to be a long wait to test the outcome.
Next steps - Build a decoder to convert fused embeddings back into audio (this will be used for testing as well as later for the model to learn to speak), build the instinct head and update the proto learners with the new outputs from the disentangler model.