Decoder

Flexible Multi-Decoder System for Audio Source Separation

The Flexible Multi-Decoder System leverages rich embeddings from the specialized disentangler model to reconstruct high-quality separated audio for any number of signal types. This architecture supports voice/noise separation and easily extends to musical instruments like trumpet, violin, etc.

šŸŽÆ Architecture Overview

Core Philosophy

šŸ“Š Input Features

Voice Features (from Enhanced Disentangler)

voice_embeddings: 160-dim (main aggregated features)
ā”œā”€ā”€ voice_pitch_features: 32-dim (F0 and harmonic tracking)
ā”œā”€ā”€ voice_harmonic_features: 32-dim (harmonic template matching)  
ā”œā”€ā”€ voice_formant_features: 32-dim (vocal tract resonances)
ā”œā”€ā”€ voice_vad_features: 16-dim (voice activity detection)
└── voice_spectral_features: 48-dim (multi-scale voice patterns)
 

Noise Features (from Enhanced Disentangler)

noise_embeddings: 144-dim (main aggregated features)
ā”œā”€ā”€ noise_broadband_features: 32-dim (non-harmonic spectral)
ā”œā”€ā”€ noise_transient_features: 24-dim (onset/impact detection)
ā”œā”€ā”€ noise_environmental_features: 28-dim (environmental sounds)
ā”œā”€ā”€ noise_texture_features: 20-dim (noise pattern analysis)
ā”œā”€ā”€ noise_nonharmonic_features: 24-dim (non-periodic content)
└── noise_statistical_features: 16-dim (temporal statistics)
 

šŸ—ļø Decoder Architecture

VoiceFeatureConditioner

NoiseFeatureConditioner

CausalAdapter

EnhancedBasisRenderer

šŸŽµ Audio Processing

Frame Rendering

Overlap-Add (OLA)

šŸ“ File Structure

models/decoder/
ā”œā”€ā”€ enhanced_decoder.py           # Main decoder architecture
ā”œā”€ā”€ streaming_decoder.py          # Streaming interfaces
ā”œā”€ā”€ README.md                     # This file
└── conditioners/                 # Existing conditioners (legacy)
    ā”œā”€ā”€ hf_aware_conditioner.py
    ā”œā”€ā”€ noise_conditioner.py
    └── ...

dataset_prep/decoder/
└── create_enhanced_decoder_dataset.py  # Enhanced dataset creation
 

šŸš€ Usage Examples

Creating the System

from enhanced_decoder import create_enhanced_decoder_system
from streaming_decoder import create_streaming_decoder

# Create lightweight system (~6.7M parameters)
system = create_enhanced_decoder_system(lightweight=True)

# Create unified interface (batch + streaming)
decoder = create_streaming_decoder(system, mode='unified')
 

Batch Processing (Training)

# Set to batch mode for training
decoder.set_mode('batch')

# Process full sequences
outputs = decoder.process(voice_embeddings, noise_embeddings)
# Returns: voice_waveform, noise_waveform, voice_f0, voice_loudness, etc.
 

Streaming Processing (Real-time)

# Set to streaming mode
decoder.set_mode('streaming')

# Process frame-by-frame
for chunk in embedding_stream:
    voice_chunk = chunk['voice_embeddings']  # [1, chunk_size, dims]
    noise_chunk = chunk['noise_embeddings']  # [1, chunk_size, dims]
    
    outputs = decoder.process(voice_chunk, noise_chunk, return_auxiliary=True)
    # Returns: voice_audio, noise_audio, mixture_audio + auxiliary features
    
    # Real-time audio output
    play_audio(outputs['voice_audio'])  # [chunk_size * 80] samples
 

Dataset Creation

# Create enhanced dataset with specialized embeddings
python dataset_prep/decoder/create_enhanced_decoder_dataset.py \
    --data_root /path/to/triplet/data \
    --checkpoint_path /path/to/disentangler/checkpoint.pt \
    --output_dir /path/to/enhanced_decoder_dataset \
    --batch_size 144 \
    --max_samples 100000
 

šŸ“ˆ Performance Characteristics

Model Size

Latency (Streaming Mode)

Quality Metrics (Expected)

šŸ”§ Training Configuration

Enhanced Dataset Features

# Each sample contains rich embeddings
sample = {
    # Main embeddings
    'voice_embeddings': torch.Tensor,      # [T, 160] 
    'noise_embeddings': torch.Tensor,      # [T, 144]
    
    # Individual voice features  
    'voice_pitch_features': torch.Tensor,  # [T, 32]
    'voice_harmonic_features': torch.Tensor, # [T, 32]
    'voice_formant_features': torch.Tensor, # [T, 32]
    'voice_vad_features': torch.Tensor,     # [T, 16]
    'voice_spectral_features': torch.Tensor, # [T, 48]
    
    # Individual noise features
    'noise_broadband_features': torch.Tensor,    # [T, 32]
    'noise_transient_features': torch.Tensor,    # [T, 24]
    'noise_environmental_features': torch.Tensor, # [T, 28]
    'noise_texture_features': torch.Tensor,      # [T, 20]
    'noise_nonharmonic_features': torch.Tensor,  # [T, 24]
    'noise_statistical_features': torch.Tensor,  # [T, 16]
    
    # Target audio
    'voice': torch.Tensor,     # [32000] samples (2 seconds)
    'noise': torch.Tensor,     # [32000] samples  
    'mixture': torch.Tensor,   # [32000] samples
    
    # Auxiliary features
    'f0': torch.Tensor,        # [T] F0 values
    'loudness': torch.Tensor,  # [T] RMS loudness
}
 

Training Recommendations

# Recommended training config
config = {
    'batch_size': 16,           # Fits in RTX 4090 24GB
    'learning_rate': 1e-4,      # Conservative for stability
    'scheduler': 'cosine',      # Smooth LR decay
    'max_epochs': 100,
    'patience': 15,
    
    # Loss weights
    'reconstruction_weight': 1.0,     # Main reconstruction loss
    'spectral_weight': 0.5,           # Multi-scale STFT loss
    'auxiliary_weight': 0.1,          # F0/loudness prediction
    'consistency_weight': 0.2,        # voice + noise = mixture
}
 

šŸŽ›ļø Advanced Features

Modular Head Support

The architecture is designed to easily add new heads for different sound sources:

# Example: Add instrument decoder
class InstrumentDecoder(EnhancedVoiceDecoder):
    def __init__(self, instrument_embeddings_config):
        super().__init__(
            # Configure for instrument-specific features
        )

# Add to system
system.instrument_decoder = InstrumentDecoder(config)
 

Streaming Optimizations

Quality Enhancements

šŸ”„ Integration with Disentangler

Pipeline Flow

Raw Audio [16kHz, mono]
         ↓
Enhanced Disentangler (specialized encoders)
         ↓
Rich Embeddings (voice: 160-dim, noise: 144-dim + features)
         ↓
Enhanced Decoder System
         ↓
Reconstructed Audio [voice, noise, mixture]
 

Compatible Checkpoints

šŸŽÆ Next Steps

  1. Dataset Creation: Use create_enhanced_decoder_dataset.py with trained disentangler
  2. Training Setup: Implement training loop with multi-component losses
  3. Evaluation: Compare against previous decoder on separation metrics
  4. Optimization: Fine-tune basis initialization and conditioning strategies
  5. Deployment: Test streaming performance on target hardware