DJ-AI CMS

Update On The Disentangler Model

After training I found a few issues with the disentangler model (designed to separate noise and voice for downstream consumption)

The model has multiple output heads, including raw embeddings for voice and noise, PCEN embeddings for voice and noise and fused embeddings (raw+PCEN) for voice and noise.

There were issues with the timing of these embeddings from each head, I carried out a number of tests on various approaches to eliminate, as much as possible, any latency on each head as well as ensuring that the final outputs were all aligned so downstream consumers could marry up the embeddings across heads without any further latency.

This lead to a few other refactors to improve the overall streaming architecture to ensure the best possible throughput without sacrificing quality needed for learning and downstream needs. It also led to the need to bolter my training dataset. Initially I was training on a customised triplet dataset that consisted of 1 million noise / voice / mixed 2 second triplet wav files all in English, but to ensure the model would be truly language agnostic I added an additional 1 million triplets from a variety of languages with enough global phoneme variation to ensure that the model didn't get too comfortable with English only.

Now the model has:

Precise synchronization: All embeddings guaranteed to represent same time points
100Hz target resolution: Required for harmonic/prosodic learning
Biological accuracy: Matches auditory cortex temporal processing
Streaming compatibility: Seamlessly switches between training/inference modes
Multi-scale features: Raw (transients) + PCEN (harmonics) + Fused (integrated)
Time-specific access: perfect for proto-syllable learning

Key Strengths:

Perfect for downstream proto-learning:

aligned_emb = model.get_aligned_embeddings_at_time(audio, time_ms=150)

Gets voice_raw, voice_pcen, voice_fused all from exactly 150ms

Voice path will learn:

- Formant patterns (from voice-only clips)

- Harmonic structures (from PCEN processing)

- Acoustic boundaries (from mixed vs voice comparison)

Proto-Syllable Discovery:

Temporal alignment enables:

- Onset/offset detection (voice vs mixed timing)

- Coarticulation patterns (voice evolution in mixed)

- Prosodic structure (stress patterns in voice)

Noise Robustness:

Critical for real-world deployment:

- Learns voice features that persist despite noise

- Develops noise-invariant representations

- Handles variable SNR conditions

Model Architecture: Streaming-ready with temporal alignment
Training Pipeline: Progressive phases with proper state management
Dataset: Optimal for self-learning voice/noise separation
Downstream Ready: Aligned embeddings perfect for my proto-syllable learning
Biological Inspiration: Mirrors auditory cortex development

Expected Training Outcomes:

Phase 1: Raw encoder learns transient patterns, onsets, FM sweeps
Phase 2: PCEN encoder learns harmonic structures, formants, pitch
Phase 3: Fusion layer integrates multi-scale voice representations
Phase 4: Joint training optimizes full voice/noise disentanglement

Result: A streaming-capable model that separates voice from noise with aligned multi-scale embeddings, ready for downstream
proto-phoneme/syllable learning and other auditory cortex modules.

The one big downside of all these enhancements - My batch-size has had to be reduced from 96 right down to 32 (even 48 hit the limits of my RTX4090). This isn't helpful for the NT-XENT separation which works better with larger batch sizes, and now with 1 million additional triplets in the dataset, a single epoch takes close to 6 hours and I am expecting to need at least 60+ epochs per training phase, so it's going to be a long wait to test the outcome.

Next steps - Build a decoder to convert fused embeddings back into audio (this will be used for testing as well as later for the model to learn to speak), build the instinct head and update the proto learners with the new outputs from the disentangler model.

DJ-AI

Update On The Disentangler Model

Key Strengths:

Perfect for downstream proto-learning:

Voice path will learn:

Proto-Syllable Discovery:

Noise Robustness:

Expected Training Outcomes:

Recent Posts

Writing and Translating Help Files - Spwig

AI integrated e-commerce platform Spwig

My AI Rig

Disentangler and Decoder Outcome

Fovea Vision for Computer Vision

Update On The Disentangler Model

Key Strengths:

Perfect for downstream proto-learning:

Voice path will learn:

Proto-Syllable Discovery:

Noise Robustness:

Expected Training Outcomes:

Related Posts

Recent Posts