DJ-AI

Visual Cortex

Visual Cortex: Fovea-Inspired Computer Vision System

Overview

The Visual Cortex is an experimental biologically-inspired computer vision system designed for the TABULA project. It implements a fovea-like attention mechanism that mimics human visual processing, automatically focusing computational resources on areas of interest while maintaining peripheral awareness for motion and changes.

=§ Status: Work in Progress - This system is under active development and subject to architectural changes.

Key Features

<¯ Foveal Attention System

Dynamic Focus: Automatically concentrates processing power on regions of interest with higher detail resolution
Peripheral Awareness: Maintains low-resolution monitoring of the entire visual field for motion and new objects
Adaptive Detail Levels: Adjusts processing detail from 0.0 (coarse) to 1.0 (fine) based on attention requirements
Real-time Focus Switching: Automatically shifts attention to areas requiring immediate processing

¡ Performance Optimization

Compute Efficiency: Designed for tight, fast computation with selective high-detail processing
Streaming Architecture: Processes video in real-time with temporal buffering
Target Performance: 30+ FPS with <200ms latency budget
Memory Management: Optimized for <7GB GPU memory usage

>à Hierarchical Processing

Multi-scale Feature Extraction: Fine, medium, and coarse feature levels
Temporal Consistency: Maintains object tracking across frames
Motion Detection: Binary motion classification for attention triggering
Instinctive Detectors: Pre-trained detectors for faces, hands, and potential threats

Architecture

Core Components

1. Streaming Visual Encoder (`encoders/streaming_encoder.py`)

Processes video frames with temporal context
Maintains 5-frame temporal buffer
Extracts multi-scale features for downstream processing
Implements causal convolutions for real-time processing

2. Hierarchical Attention System (`attention/attention_system.py`)

Motion Detector: Identifies movement in peripheral vision
Instinctive Detector: Recognizes faces, hands, and salient objects
Attention Controller: Manages focus point and attention allocation
Track Manager: Maintains temporal consistency of attended objects

3. Component Extractor (`core/component_extractor.py`)

Extracts and segments objects of interest
Generates embeddings for symbolic representation
Performs panoptic segmentation when needed
Encodes spatial relationships between components

4. Performance Monitor (`utils/performance_monitor.py`)

Tracks processing latency and FPS
Monitors memory usage
Manages quality degradation under load
Ensures latency budget compliance

Processing Pipeline

Video Input ’ Streaming Encoder ’ Attention System ’ Component Extraction ’ Symbolic Output
     “              “                    “                    “
[RGB Frames]  [Multi-scale      [Focus Decision]    [Object Segments]
              Features]          [Motion Detection]  [Embeddings]
                                [Attention Map]      [Relationships]

Training Strategy

The system employs a progressive multi-phase training approach:

Phase 1: Instinctive Detectors

Train basic object detectors on synthetic data
Initialize face, hand, and threat detection capabilities
Duration: 10-15 epochs

Phase 2: Streaming Backbone

Train temporal feature extraction with video data
Build temporal consistency and motion understanding
Duration: 20-25 epochs

Phase 3: Attention System

Train attention mechanisms with frozen backbone
Develop focus switching and priority assessment
Duration: 15-20 epochs

Phase 4: Component Extraction

Train segmentation and embedding generation
Optimize for symbolic representation quality
Duration: 15-20 epochs

Phase 5: End-to-End Fine-tuning

Optimize complete system for real-time performance
Fine-tune all components jointly
Duration: 5-10 epochs

Installation

Prerequisites

# Required packages
pip install torch torchvision
pip install numpy opencv-python
pip install wandb tensorboard  # Optional for training visualization

Setup

# Clone the repository
git clone [repository_url]
cd TABULA2/visual_cortex

# Install visual cortex module
pip install -e .

Usage

Basic Inference

from visual_cortex.core import VisualCortex
import torch

# Initialize the visual cortex
cortex = VisualCortex(
    frame_height=224,
    frame_width=224,
    temporal_buffer_size=5,
    target_fps=30,
    latency_budget_ms=200.0
)

# Process a video frame
frame = torch.randn(1, 3, 224, 224)  # [B, C, H, W]
output = cortex(frame, streaming=True)

if output is not None:
    print(f"Focused object: {output.focused_object_id}")
    print(f"Processing time: {output.processing_time_ms:.2f}ms")
    print(f"Attention candidates: {output.investigation_candidates}")

Training

from visual_cortex.training import MultiStageTrainer
from visual_cortex.training.training_config import TrainingConfig

# Configure training
config = TrainingConfig(
    phase="all",
    num_epochs=50,
    batch_size=4,
    learning_rate=1e-4
)

# Initialize trainer
trainer = MultiStageTrainer(config)

# Train all phases
results = trainer.train_full_pipeline(model)

Configuration

Model Parameters

frame_height: Input frame height (default: 224)
frame_width: Input frame width (default: 224)
temporal_buffer_size: Number of frames in temporal buffer (default: 5)
target_fps: Target frames per second (default: 30)
latency_budget_ms: Maximum processing time per frame (default: 200ms)

Training Parameters

See visual_cortex/configs/training/ for detailed configuration options.

Documentation

Modular Structure: Detailed architecture and module organization
Training Strategy: Comprehensive training guide

Performance Metrics

Target Specifications

Metric	Target	Critical Threshold
FPS	>30	>24
Latency	<200ms	<250ms
Detection Accuracy	>85%	>80%
Memory Usage	<7GB	<8GB
Error Rate	<1%	<3%

Development Roadmap

Current Focus

Core architecture implementation
Multi-phase training pipeline
Attention system with motion detection
Real-world video training
Performance optimization
Integration with TABULA memory system

Future Enhancements

Dynamic resolution adjustment
Multi-object simultaneous tracking
Predictive attention based on motion trajectories
Integration with auditory cortex for multi-modal attention
Self-supervised learning from video streams
Hardware acceleration optimization

Contributing

This is an experimental component of the TABULA project. The architecture is subject to significant changes as the system evolves. Contributions should focus on:

Performance optimization
Attention mechanism improvements
Training stability enhancements
Real-world testing and evaluation

Technical Notes

Memory Management

The system implements aggressive memory management strategies:

Gradient checkpointing for training
Dynamic batch size adjustment
Temporal buffer recycling
Feature map pooling for memory efficiency

Real-time Constraints

To maintain real-time performance:

Causal convolutions prevent future frame dependencies
Streaming mode processes frames incrementally
Quality degradation under high load
Asynchronous component extraction when possible

License

Part of the TABULA project - see main project license.

References

The foveal attention mechanism is inspired by:

Human visual system architecture
Primate visual cortex organization
Active vision and saccadic eye movement research

Note: This module is part of the experimental TABULA cognitive brain project and is designed to work in conjunction with the auditory cortex and symbolic memory systems.