DJ-AI CMS

Rebuilding Sound from its DNA

In my audio processing pipeline, I've two components that work together to transform messy, mixed audio into clear voice and noise streams.

The Disentangler is responsible for disentangling voice and noise from audio and breaking these down into a rich set of learned representations. The Decoder takes these representations and reassembles them back into audio, producing a clean voice track and an isolated noise track that can be used for my AI's own speech enhancement and proto-phoneme/word/grammar learning and analysis. The Disentangler produces a multitude of embeddings by listening to input waveforms, separating the voice from the noise and then outputting raw-path embeddings for capturing fine-grained, waveform-level detail, PCEN-path embeddings (Per-Channel Energy Normalization) features that emphasize perceptually important changes and fused embeddings which has learned blends of raw and PCEN streams, optimized for clarity and separation.

Each of these streams is aligned in time, so their information lines up frame-by-frame. For example, the voice_fused output is a gated, learned mixture of the raw and PCEN voice features, producing a balanced representation that’s robust to varying recording conditions. Similarly, the noise path benefits from its own fusion of representations.

This multitude of embeddings means my downstream models are not relying on a single viewpoint as the model captures complementary perspectives on the same signal, making it far more resistant to distortion, background clutter, or domain shifts

The Decoder works with these embeddings, and with my Multi-Embedding Conditioning, projects each embedding type (fused, raw, PCEN, context) into a shared hidden space, learns importance weights for each type and applies attention across embedding types to let them inform each other before merging. The result is a conditioned embedding stream that’s optimized for waveform synthesis. This is followed by Causal Reconstruction. The conditioned sequence passes through Causal temporal convolution blocks for sequential modelling without peeking into the future, a learned basis renderer, which maps each frame to a set of basis waveforms and an overlap-add system that stitches these frames back into a continuous waveform with perfect phase alignment and smoothness.

The DualDecoderSystem runs two decoders in parallel, a Voice Decoder, which reconstructs the clean speech waveform and predicts pitch (F0) and loudness per frame and a Noise Decoder which reconstructs the isolated noise waveform (without feature heads)

Because single embedding stream can capture only a slice of the signal’s complexity where my approach uses multiple synchronized streams, each trained to emphasize different aspects of the signal where we have the raw path for detail and microstructure, the PCEN path for perceptual dynamics and stability and the fused path for optimal learned combination for target domain this enables the conditioner to learn both how much to trust each embedding and how to combine them, it can adapt to real-world variation. That is, if the raw path is noisy, it can lean more on PCEN, if the PCEN path loses fine texture, it can draw from the raw stream, while the fused path provides a balanced “best guess” when both are reliable making the system resilient to changes in microphone type, environment, and background noise.

here's my attempt at another ASCII diagram…


                          ┌─────────────────────────────── Input: Mixed Audio (16 kHz) ─────────────────────────────────┐
                          │                                                                                             │
                          │                                          DISENTANGLER                                       │
                          │                                                                                             │
                          │   ┌─────────────── Raw Encoder (causal) ────────────────┐   ┌──────────── PCEN Encoder ── ─┐│
                          │   │                                                     │   │ (Per-Channel Energy Norm)    ││
                          │   │  voice_raw_feats  → [B, C, T_raw]                   │   │  voice_pcen_feats → [B, C, Tp]│ 
                          │   │  noise_raw_feats  → [B, C, T_raw]                   │   │  noise_pcen_feats → [B, C, Tp]│ 
                          │   └─────────────────────────────────────────────────────┘   └───────────────────────────────┘ 
                          │                                                                                             │
                          │   Temporal Alignment → downsample raw_* to PCEN grid (≈100 Hz; 10 ms hop)                   │
                          │                                                                                             │
                          │   Per-branch (voice and noise) 1×1 Projections (to 128 ch)                                  │
                          │    ┌──────────────────────┐     ┌──────────────────────┐     ┌────────────────────────────┐ │
                          │    │ voice_raw  [B,128,Tp]│     │ voice_pcen [B,128,Tp]│     │ voice_fused_in [B,256,Tp]  │ │
                          │    └──────────┬───────────┘     └─────────┬────────────┘     └─────────────┬──────────────┘ │
                          │               │                           │                            cat │                │
                          │               └────────────  gated blend ─┴────────────────────────────────┘                │
                          │                              (learned mixing gate)                                          │
                          │                                      │                                                      │
                          │                         voice_fused [B,128,Tp]       noise_fused [B,128,Tp]                 │
                          │                                      │                         │                            │
                          └──────────────────────────────────────┴─────────────────────────┴────────────────────────────┘
                                                                 │                         │
                                                                 │                         │
                                                                 ▼                         ▼
                                    ┌──────────────────────────────────────────────────────────────────────────────┐
                                    │                                 DECODER SYSTEM                               │
                                    │                                                                              │
                                    │                         (Two parallel decoders)                              │
                                    │         ┌─────────────────────────┐                 ┌──────────────────────┐ │
                                    │         │       VOICE DECODER     │                 │     NOISE DECODER    │ │
                                    │         │  (predicts F0 & loud.)  │                 │   (no feature heads) │ │
                                    │         └───────────┬─────────────┘                 └───────────┬──────────┘ │
                                    │                     │                                           │            │
                                    │     Multi-Embedding Conditioner                                 │            │
                                    │   (projects + fuses: fused/raw/pcen                             │            │
                                    │    with learned importance and optional attention)              │            │
                                    │                     │                                           │            │
                                    │               z: [B,T,D_hid]                                    │            │
                                    │                     │                                           │            │
                                    │       Causal Adapter (stack of TCN blocks)                      │            │
                                    │                     │                                           │            │
                                    │         Basis Renderer (n_basis × 25 ms)                        │            │
                                    │                     │                                           │            │
                                    │        Overlap-Add (Hann, COLA-norm, DC-high-pass)              │            │
                                    │             ┌───────┴────────┐                           ┌──────┴───────┐    │
                                    │             │ Voice waveform │                           │ Noise wave.  │    │
                                    │             │   [B, T_aud]   │                           │  [B, T_aud]  │    │
                                    │             └───────┬────────┘                           └──────┬───── ─┘    │
                                    │                     │                                           │            │
                                    │     Aux heads:  F0 [B,Tf], Loudness [B,Tf]                      │            │
                                    │                                                                              │
                                    └───────────────────────┬──────────────────────────────────────────────────────┘
                                                            │
                                                            ▼
                                   training losses & constraints (for stability & quality):
                                   • Multi-Resolution STFT (voice/noise)          • SI-SDR (voice/noise)
                                   • Mixture consistency:  voice + noise ≈ mixture (STFT + L1 time-domain)
                                   • Speech-leakage penalty on noise (300–3400 Hz band energy ratio)
                                   • Spectral orthogonality (per-frame cosine similarity of mags)
                                   • Basis health (L2 & decorrelation)            • Energy stability vs loudness
                                   • Gentle tanh-safety & frame-energy ramp (flicker control)

DJ-AI

Rebuilding Sound from its DNA

Rebuilding Sound from its DNA

Recent Posts

Writing and Translating Help Files - Spwig

AI integrated e-commerce platform Spwig

My AI Rig

Disentangler and Decoder Outcome

Fovea Vision for Computer Vision

Rebuilding Sound from its DNA

Rebuilding Sound from its DNA

Related Posts

Recent Posts