Visual Cortex
The Visual Cortex is an experimental biologically-inspired computer vision system designed for the TABULA project. It implements a fovea-like attention mechanism that mimics human visual processing, automatically focusing computational resources on areas of interest while maintaining peripheral awareness for motion and changes.
=§ Status: Work in Progress - This system is under active development and subject to architectural changes.
- Dynamic Focus: Automatically concentrates processing power on regions of interest with higher detail resolution
- Peripheral Awareness: Maintains low-resolution monitoring of the entire visual field for motion and new objects
- Adaptive Detail Levels: Adjusts processing detail from 0.0 (coarse) to 1.0 (fine) based on attention requirements
- Real-time Focus Switching: Automatically shifts attention to areas requiring immediate processing
- Compute Efficiency: Designed for tight, fast computation with selective high-detail processing
- Streaming Architecture: Processes video in real-time with temporal buffering
- Target Performance: 30+ FPS with <200ms latency budget
- Memory Management: Optimized for <7GB GPU memory usage
- Multi-scale Feature Extraction: Fine, medium, and coarse feature levels
- Temporal Consistency: Maintains object tracking across frames
- Motion Detection: Binary motion classification for attention triggering
- Instinctive Detectors: Pre-trained detectors for faces, hands, and potential threats
- Processes video frames with temporal context
- Maintains 5-frame temporal buffer
- Extracts multi-scale features for downstream processing
- Implements causal convolutions for real-time processing
- Motion Detector: Identifies movement in peripheral vision
- Instinctive Detector: Recognizes faces, hands, and salient objects
- Attention Controller: Manages focus point and attention allocation
- Track Manager: Maintains temporal consistency of attended objects
- Extracts and segments objects of interest
- Generates embeddings for symbolic representation
- Performs panoptic segmentation when needed
- Encodes spatial relationships between components
- Tracks processing latency and FPS
- Monitors memory usage
- Manages quality degradation under load
- Ensures latency budget compliance
Video Input ’ Streaming Encoder ’ Attention System ’ Component Extraction ’ Symbolic Output
“ “ “ “
[RGB Frames] [Multi-scale [Focus Decision] [Object Segments]
Features] [Motion Detection] [Embeddings]
[Attention Map] [Relationships]
The system employs a progressive multi-phase training approach:
- Train basic object detectors on synthetic data
- Initialize face, hand, and threat detection capabilities
- Duration: 10-15 epochs
- Train temporal feature extraction with video data
- Build temporal consistency and motion understanding
- Duration: 20-25 epochs
- Train attention mechanisms with frozen backbone
- Develop focus switching and priority assessment
- Duration: 15-20 epochs
- Train segmentation and embedding generation
- Optimize for symbolic representation quality
- Duration: 15-20 epochs
- Optimize complete system for real-time performance
- Fine-tune all components jointly
- Duration: 5-10 epochs
# Required packages pip install torch torchvision pip install numpy opencv-python pip install wandb tensorboard # Optional for training visualization
# Clone the repository git clone [repository_url] cd TABULA2/visual_cortex # Install visual cortex module pip install -e .
from visual_cortex.core import VisualCortex import torch # Initialize the visual cortex cortex = VisualCortex( frame_height=224, frame_width=224, temporal_buffer_size=5, target_fps=30, latency_budget_ms=200.0 ) # Process a video frame frame = torch.randn(1, 3, 224, 224) # [B, C, H, W] output = cortex(frame, streaming=True) if output is not None: print(f"Focused object: {output.focused_object_id}") print(f"Processing time: {output.processing_time_ms:.2f}ms") print(f"Attention candidates: {output.investigation_candidates}")
from visual_cortex.training import MultiStageTrainer from visual_cortex.training.training_config import TrainingConfig # Configure training config = TrainingConfig( phase="all", num_epochs=50, batch_size=4, learning_rate=1e-4 ) # Initialize trainer trainer = MultiStageTrainer(config) # Train all phases results = trainer.train_full_pipeline(model)
frame_height: Input frame height (default: 224)frame_width: Input frame width (default: 224)temporal_buffer_size: Number of frames in temporal buffer (default: 5)target_fps: Target frames per second (default: 30)latency_budget_ms: Maximum processing time per frame (default: 200ms)
See visual_cortex/configs/training/ for detailed configuration options.
- Modular Structure: Detailed architecture and module organization
- Training Strategy: Comprehensive training guide
| Metric | Target | Critical Threshold |
|---|---|---|
| FPS | >30 | >24 |
| Latency | <200ms | <250ms |
| Detection Accuracy | >85% | >80% |
| Memory Usage | <7GB | <8GB |
| Error Rate | <1% | <3% |
This is an experimental component of the TABULA project. The architecture is subject to significant changes as the system evolves. Contributions should focus on:
- Performance optimization
- Attention mechanism improvements
- Training stability enhancements
- Real-world testing and evaluation
The system implements aggressive memory management strategies:
- Gradient checkpointing for training
- Dynamic batch size adjustment
- Temporal buffer recycling
- Feature map pooling for memory efficiency
To maintain real-time performance:
- Causal convolutions prevent future frame dependencies
- Streaming mode processes frames incrementally
- Quality degradation under high load
- Asynchronous component extraction when possible
Part of the TABULA project - see main project license.
The foveal attention mechanism is inspired by:
- Human visual system architecture
- Primate visual cortex organization
- Active vision and saccadic eye movement research
Note: This module is part of the experimental TABULA cognitive brain project and is designed to work in conjunction with the auditory cortex and symbolic memory systems.