Disentangler

Disentangler Model Architecture

The Disentangler Model is a specialized audio source separation system designed for AGI-compatible self-supervised learning with optional supervised guidance. It uses dedicated voice and noise encoders to extract rich, domain-specific features for high-quality audio separation.

šŸŽÆ Architecture Overview

Core Philosophy

šŸ“Š Model Specifications

Audio Parameters

Embedding Dimensions

Voice Embeddings: 160-dim total
ā”œā”€ā”€ Pitch Features:     32-dim (F0, harmonic tracking)
ā”œā”€ā”€ Harmonic Features:  32-dim (harmonic template matching)
ā”œā”€ā”€ Formant Features:   32-dim (vocal tract resonances)
ā”œā”€ā”€ VAD Features:       16-dim (voice activity detection)
└── Spectral Features:  48-dim (multi-scale voice patterns)

Noise Embeddings: 144-dim total
ā”œā”€ā”€ Broadband Features:    32-dim (non-harmonic spectral)
ā”œā”€ā”€ Transient Features:    24-dim (onset/impact detection)
ā”œā”€ā”€ Environmental Features: 28-dim (environmental sounds)
ā”œā”€ā”€ Texture Features:      20-dim (noise pattern analysis)
ā”œā”€ā”€ Non-harmonic Features: 24-dim (non-periodic content)
└── Statistical Features:  16-dim (temporal statistics)
 

šŸ—ļø Encoder Architecture

Batch vs Streaming Variants

Batch Encoders (Default - streaming_mode=False)

Streaming Encoders (streaming_mode=True)

Voice Encoder Technologies

1. Pitch Extraction (32-dim)

2. Harmonic Analysis (32-dim)

3. Formant Analysis (32-dim)

4. Voice Activity Detection (16-dim)

5. Multi-scale Spectral Analysis (48-dim)

Noise Encoder Technologies

1. Broadband Analysis (32-dim)

2. Transient Detection (24-dim)

3. Environmental Sound Analysis (28-dim)

4. Texture Analysis (20-dim)

5. Non-harmonic Content (24-dim)

6. Statistical Analysis (16-dim)

šŸŽµ Audio Reconstruction

Reconstruction Networks

Reconstruction Outputs

voice_reconstruction: [B, 1, T] - Reconstructed clean voice
noise_reconstruction: [B, 1, T] - Reconstructed clean noise  
mixed_reconstruction: [B, 1, T] - Sum of voice + noise (should equal input)
 

šŸ”„ Training Approach

Hybrid Loss Function (AGI-Compatible)

Total Loss Weights:
ā”œā”€ā”€ NTXENT (Contrastive):     50% - Primary disentanglement signal
ā”œā”€ā”€ BYOL (Self-supervised):   30% - Robust representation learning  
└── Supervised Guidance:      20% - Audio quality bootstrap
 

Enhanced Loss Components

  1. NTXENT Contrastive Loss: TRUE positive embeddings (optimized path)
    • Separate voice/noise NT-Xent instances with memory banks
    • Pre-computed positive embeddings with gradient flow
  2. BYOL Self-supervised: EMA target encoders (prevents collapse)
    • Voice/noise target encoders with Ļ„=0.999 momentum
    • Stop-gradient on target side for proper BYOL learning
  3. Supervised Reconstruction: Multi-component reconstruction loss
    • Time-domain MSE loss for accuracy
    • Multi-resolution STFT loss (5 scales: 256-4096 FFT)
    • Spectral convergence loss for magnitude consistency
    • High-frequency emphasis (4-8kHz weighted 4-6x)
  4. Separation Loss: Barlow Twins style decorrelation
    • Cross-correlation minimization between voice/noise embeddings
    • Bounded variance maintenance (prevents collapse)

šŸ“ File Structure

disentangler/
ā”œā”€ā”€ disentangler_model.py           # Main model class
ā”œā”€ā”€ encoders/
│   ā”œā”€ā”€ __init__.py                  # Encoder exports
│   ā”œā”€ā”€ voice_encoder.py             # Streaming voice encoder
│   ā”œā”€ā”€ noise_encoder.py             # Streaming noise encoder
│   ā”œā”€ā”€ batch_voice_encoder.py       # Batch voice encoder (default)
│   └── batch_noise_encoder.py       # Batch noise encoder (default)
ā”œā”€ā”€ ../../training/disentangler/
│   ā”œā”€ā”€ enhanced_loss_functions.py   # Hybrid loss implementation
│   └── train_disentangler_sequential.py  # Training script
└── configs/
    └── phase_hybrid_supervised.yaml # Training configuration
 

šŸš€ Usage Examples

Basic Model Initialization

from disentangler_model import DisentanglerModel

# Default: Batch encoders for best quality
model = DisentanglerModel(
    streaming_mode=False,  # Use batch encoders (default)
    voice_embedding_dim=160,
    noise_embedding_dim=144,
    enable_reconstruction=True
)

# Streaming: For real-time processing  
streaming_model = DisentanglerModel(
    streaming_mode=True,   # Use streaming encoders
    voice_embedding_dim=160,
    noise_embedding_dim=144,
    enable_reconstruction=False  # Typically disabled for streaming
)
 

Forward Pass

import torch

# Input: [batch_size, 1, time_samples]
mixed_audio = torch.randn(4, 1, 16000)  # 4 samples, 1 second each

# Forward pass
outputs = model(mixed_audio, return_reconstructions=True)

# Outputs:
# - voice_embeddings: [4, 160, 100]  # 160-dim voice features, 100 frames
# - noise_embeddings: [4, 144, 100]  # 144-dim noise features, 100 frames  
# - voice_reconstruction: [4, 1, 16000]  # Reconstructed clean voice
# - noise_reconstruction: [4, 1, 16000]  # Reconstructed clean noise
# - Individual feature components for analysis
 

Training Configuration

# Training uses batch encoders automatically
trainer = DisentanglerTrainer(
    model=model,
    hybrid_loss_weights={
        'ntxent_weight': 0.5,      # Primary: Contrastive learning
        'byol_weight': 0.3,        # Primary: Self-supervised  
        'supervised_weight': 0.2   # Secondary: Clean target guidance
    }
)
 

šŸ“ˆ Performance Characteristics & Monitoring

Computational Requirements

Key Training Metrics (TensorBoard)

Monitor these for disentanglement quality:

Quality vs Speed Trade-offs

Numerical Stability Features

šŸŽÆ Design Goals Achieved

āœ… AGI Compatibility: Strong self-supervised learning (80% of loss)
āœ… Voice-Specific Processing: Dedicated pitch, harmonic, formant analysis
āœ… Noise-Specific Processing: Environmental, transient, texture analysis
āœ… Multiple Views: Batch vs streaming for different deployment needs
āœ… Rich Features: No dimension compression, preserve all information
āœ… Supervised Bootstrap: Clean targets improve quality without compromising self-learning
āœ… Temporal Modeling: Enhanced batch encoders with better temporal detail