Fovea Vision for Computer Vision
The other morning I was sleeping in a little while the morning sky was still dark. My wife had woken up earlier than me to tend to our boy, and at some point needed to turn on the lights in our bedroom.
Something very brief but very cool happened when she turned the lights on. While my eyes were closed and I was half asleep still, the sudden change in the rooms brightness created a very interesting pattern in my eyes for a moment. I saw a bright solid dot right at the center of my vision, with rings of pixel like dots of mostly yellow with some green, blue and red. These rings appeared to be in a formation of thick rings with fewer βpixelsβ and thinner denser rings, repeating out from the small dot at the center.
This gave me an idea of using a similar approach for my visual-cortex, where the center (fovea) has a higher pixel resolution for extreme detail, with outer overlapping rings having less and less detail. This would ensure that compute is utilised where it's really needed while still computing enough detail in the surrounding space to generalise and switch focus.
To test this theory, I built out a light weight python prototype to begin testing out how the model could shift it's focus between points of interest, and how effective different approaches could be. This is still very much a prototype which I will continue to work on and push GPU when I've found the winning formula, but here's an early view of how the module looks for points of interest and shifts it's attention to it.
As for the human eye, they move approximately two to three times per second. These rapid movements, called saccades, are used to shift gaze to new points of interest. Even when focused on a single object, the eyes are constantly making tiny, vibrating movements up to around 100 times a second, so if you think the motion tracking in the video of where my AI is shifting attention to, you'll find it's probably less jittery than your eyes reading this post right now!
If you're interested and would like to play around with the code yourself, here is the main module βfovea_controller.pyβ
# MIT License - Please include reference to DJ-AI.AI - Dayyan James
import argparse
from collections import deque
import cv2
import numpy as np
import time
from face_prior import FacePrior
def clamp(v, lo, hi): return max(lo, min(hi, v))
def gaussian_prior(shape, cx, cy, sigma):
h, w = shape
if w == 0 or h == 0: # guard
return np.zeros((h, w), dtype=np.float32)
ys = np.arange(h, dtype=np.float32)
xs = np.arange(w, dtype=np.float32)
xx, yy = np.meshgrid(xs, ys)
g = np.exp(-((xx - cx)**2 + (yy - cy)**2) / (2.0 * sigma * sigma))
m = g.max()
if m > 0: g /= m
return g.astype(np.float32)
class Kalman2D:
def __init__(self, x0, y0):
self.kf = cv2.KalmanFilter(4, 2, type=cv2.CV_32F)
dt = 1.0
self.kf.transitionMatrix = np.array([[1,0,dt,0],[0,1,0,dt],[0,0,1,0],[0,0,0,1]], np.float32)
self.kf.measurementMatrix = np.array([[1,0,0,0],[0,1,0,0]], np.float32)
self.kf.processNoiseCov = np.diag([1e-2,1e-2,1e-1,1e-1]).astype(np.float32)
self.kf.measurementNoiseCov = np.diag([5e-2,5e-2]).astype(np.float32)
self.kf.statePost = np.array([[x0],[y0],[0],[0]], np.float32)
self.kf.errorCovPost = np.eye(4, dtype=np.float32)
def predict(self):
p = self.kf.predict()
return float(p[0,0]), float(p[1,0])
def correct(self, x, y):
m = np.array([[np.float32(x)],[np.float32(y)]], np.float32)
e = self.kf.correct(m)
return float(e[0,0]), float(e[1,0])
class StableFoveaController:
def __init__(self, src_w, src_h,
proc_width=512, flow_stride=2,
min_radius=96, max_radius=256,
dwell_frames=4, max_speed_px=40,
motion_w=0.8, edge_w=0.2,
sal_ema=0.8, zoom_k=0.6, zoom_ema=0.9,
stickiness=0.25, stick_sigma_frac=0.08,
switch_margin=0.10, roi_frac=0.35):
self.reset_dims(src_w, src_h, proc_width)
self.flow_stride = max(1, int(flow_stride))
self.cx, self.cy = src_w // 2, src_h // 2
self.min_r, self.max_r = int(min_radius), int(max_radius)
self.r = int(0.5 * (min_radius + max_radius))
self.dwell_frames = int(dwell_frames)
self.max_speed = float(max_speed_px)
self.motion_w, self.edge_w = float(motion_w), float(edge_w)
self.sal_ema, self.zoom_k, self.zoom_ema = float(sal_ema), float(zoom_k), float(zoom_ema)
self.prev_small = None
self.prev_small_f = None
self.sal_ema_map = None
self.dwell_counter = 0
self.trail = deque(maxlen=24)
self.zoom_state = 0.5
self.frame_idx = 0
self.last_flow_n = None
self.dis = cv2.DISOpticalFlow_create(cv2.DISOPTICAL_FLOW_PRESET_MEDIUM)
self.stickiness = float(stickiness)
self.stick_sigma_frac = float(stick_sigma_frac)
self.switch_margin = float(switch_margin)
self.roi_frac = float(roi_frac)
self.kalman = Kalman2D(self.cx, self.cy)
self.face_prior = FacePrior(method="haar", mouth_boost=0.9)
self.face_weight = 0.7 # 0.2β0.5 typical
self.face_blend = "add" # "add" or "max"
self.debug_faces = [] # for optional overlay
def reset_dims(self, src_w, src_h, proc_width):
# Ensure valid, positive processing size
self.SW, self.SH = int(max(1, src_w)), int(max(1, src_h))
self.proc_width = int(max(32, proc_width))
self.scale = self.proc_width / float(max(1, self.SW))
self.proc_height = int(max(32, round(self.SH * self.scale)))
def _to_small(self, gray_src):
if gray_src is None or gray_src.size == 0:
return None, None
# Resize guards
small = cv2.resize(gray_src, (self.proc_width, self.proc_height), interpolation=cv2.INTER_AREA)
small_f = (small.astype(np.float32) / 255.0) if small.size else None
return small, small_f
def _global_motion_shift(self, prev_f, curr_f):
try:
if prev_f is None or curr_f is None: return 0.0, 0.0
if prev_f.shape != curr_f.shape or prev_f.size == 0 or curr_f.size == 0:
return 0.0, 0.0
hann = cv2.createHanningWindow((prev_f.shape[1], prev_f.shape[0]), cv2.CV_32F)
(dx, dy), _ = cv2.phaseCorrelate(prev_f, curr_f, hann)
return float(dx), float(dy)
except Exception:
return 0.0, 0.0
def _saliency(self, small_gray, small_float):
if small_gray is None or small_float is None:
return None
# Reset prev if dims changed
if self.prev_small is not None and self.prev_small.shape != small_gray.shape:
self.prev_small = None
self.prev_small_f = None
self.sal_ema_map = None
self.last_flow_n = None
if self.prev_small is None:
flow_mag_n = np.zeros_like(small_gray, np.float32)
else:
gdx, gdy = self._global_motion_shift(self.prev_small_f, small_float)
if (self.frame_idx % self.flow_stride) == 0:
flow = self.dis.calc(self.prev_small, small_gray, None)
fx, fy = flow[..., 0] - gdx, flow[..., 1] - gdy
flow_mag = np.sqrt(fx*fx + fy*fy).astype(np.float32)
p1, p99 = np.percentile(flow_mag, 1.0), np.percentile(flow_mag, 99.0)
if p99 <= p1 + 1e-6:
flow_mag_n = np.zeros_like(flow_mag, np.float32)
else:
flow_mag_n = np.clip((flow_mag - p1) / (p99 - p1), 0, 1)
self.last_flow_n = flow_mag_n
else:
flow_mag_n = self.last_flow_n if self.last_flow_n is not None else np.zeros_like(small_gray, np.float32)
gx = cv2.Sobel(small_gray, cv2.CV_32F, 1, 0, ksize=3)
gy = cv2.Sobel(small_gray, cv2.CV_32F, 0, 1, ksize=3)
grad_mag = cv2.magnitude(gx, gy)
p1, p99 = np.percentile(grad_mag, 1.0), np.percentile(grad_mag, 99.0)
grad_mag_n = np.zeros_like(grad_mag, np.float32) if p99 <= p1 + 1e-6 else np.clip((grad_mag - p1) / (p99 - p1), 0, 1)
S = self.motion_w * flow_mag_n + self.edge_w * grad_mag_n
face_heat, faces_src = self.face_prior.compute_on_small(
small_gray,
scale_up=1.0 / self.scale,
src_shape=(self.SH, self.SW)
)
self.debug_faces = faces_src # keep for overlay/debug
if face_heat is not None and face_heat.size:
if self.face_blend == "max":
S = np.maximum(S, face_heat) # faces dominate
else:
S = (1.0 - self.face_weight) * S + self.face_weight * face_heat # additive bias
# Stickiness prior
scx, scy = self.cx * self.scale, self.cy * self.scale
sigma = max(4.0, self.stick_sigma_frac * min(self.proc_width, self.proc_height))
G = gaussian_prior(S.shape, scx, scy, sigma)
S = (1.0 - self.stickiness) * S + self.stickiness * G
# Spatial & temporal smoothing
S = cv2.GaussianBlur(S, (0, 0), sigmaX=1.0, sigmaY=1.0)
self.sal_ema_map = S if self.sal_ema_map is None else self.sal_ema * self.sal_ema_map + (1.0 - self.sal_ema) * S
self.prev_small = small_gray
self.prev_small_f = small_float
return self.sal_ema_map
def _centroid(self, S):
if S is None or S.size == 0 or S.ndim != 2:
return self.cx, self.cy, 0.0
scx, scy = self.cx * self.scale, self.cy * self.scale
roi_r = max(12.0, self.roi_frac * min(self.SW, self.SH) * self.scale)
x0 = int(clamp(scx - roi_r, 0, S.shape[1]-1)); x1 = int(clamp(scx + roi_r, 0, S.shape[1]-1))
y0 = int(clamp(scy - roi_r, 0, S.shape[0]-1)); y1 = int(clamp(scy + roi_r, 0, S.shape[0]-1))
roi = S[y0:y1+1, x0:x1+1]
roi_mean = float(roi.mean()) if roi.size else 0.0
thresh = np.percentile(S, 90.0)
mask = (S >= thresh)
if not mask.any():
return self.cx, self.cy, float(S.max() if S.size else 0.0)
ys, xs = np.nonzero(mask)
w = S[ys, xs]
wsum = w.sum()
x_small = float((xs * w).sum() / max(wsum, 1e-6))
y_small = float((ys * w).sum() / max(wsum, 1e-6))
peak = float(S.max())
# Hysteresis: stay local unless global is clearly better
if roi.size and peak < (roi_mean * (1.0 + self.switch_margin)):
ysr, xsr = np.nonzero(roi >= np.percentile(roi, 85.0))
if ysr.size:
wr = roi[ysr, xsr]; wrsum = wr.sum()
xr_small = x0 + float((xsr * wr).sum() / max(wrsum, 1e-6))
yr_small = y0 + float((ysr * wr).sum() / max(wrsum, 1e-6))
x_small, y_small = xr_small, yr_small
return x_small / self.scale, y_small / self.scale, peak
def _move(self, tx, ty):
px, py = self.kalman.predict()
txb, tyb = 0.6 * tx + 0.4 * px, 0.6 * ty + 0.4 * py
dx, dy = txb - self.cx, tyb - self.cy
dist = float(np.hypot(dx, dy))
if dist <= self.max_speed or dist == 0:
self.cx, self.cy = int(txb), int(tyb)
else:
s = self.max_speed / dist
self.cx += int(dx * s); self.cy += int(dy * s)
self.cx = int(clamp(self.cx, 0, self.SW - 1))
self.cy = int(clamp(self.cy, 0, self.SH - 1))
self.kalman.correct(self.cx, self.cy)
def _update_zoom(self, peak_sal):
self.zoom_state = self.zoom_ema * self.zoom_state + (1.0 - self.zoom_ema) * float(peak_sal)
t = max(0.0, min(1.0, self.zoom_state ** self.zoom_k))
r_target = self.max_r - t * (self.max_r - self.min_r)
self.r = int(0.85 * self.r + 0.15 * r_target)
self.r = int(clamp(self.r, self.min_r, self.max_r))
def step(self, frame_bgr):
if frame_bgr is None or frame_bgr.size == 0:
return {"cx": self.cx, "cy": self.cy, "r": self.r, "zoom_state": float(self.zoom_state),
"peak_sal": 0.0, "moved": False, "trail": list(self.trail),
"sal_small": np.zeros((80, 160), np.uint8)}
gray = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2GRAY)
small, small_f = self._to_small(gray)
S = self._saliency(small, small_f)
tx, ty, peak = self._centroid(S)
moved = False
if self.dwell_counter > 0:
self.dwell_counter -= 1
self._move(tx, ty)
else:
dist = float(np.hypot(tx - self.cx, ty - self.cy))
if dist > 0.40 * self.r:
self._move(tx, ty); self.dwell_counter = self.dwell_frames; moved = True
else:
self._move(tx, ty)
self._update_zoom(peak)
self.trail.append((self.cx, self.cy))
self.frame_idx += 1
# HUD: robust resize
if S is None or S.size == 0 or S.ndim != 2 or S.shape[1] == 0:
sal_small = np.zeros((80, 160), np.uint8)
else:
sh, sw = int(S.shape[0]), int(S.shape[1])
target_w = 160
target_h = max(1, int(round(sh * (target_w / float(max(1, sw))))))
sal_small = cv2.resize((S * 255).astype(np.uint8), (target_w, target_h), interpolation=cv2.INTER_AREA)
return {"cx": self.cx, "cy": self.cy, "r": self.r, "zoom_state": float(self.zoom_state),
"peak_sal": float(peak), "moved": moved, "trail": list(self.trail), "sal_small": sal_small}
def draw_overlay(frame, state, rings=2, ring_spacing=0.75):
cx, cy, r = state["cx"], state["cy"], state["r"]
cv2.circle(frame, (cx, cy), r, (0, 255, 255), 2, lineType=cv2.LINE_AA)
cv2.circle(frame, (cx, cy), max(2, r // 12), (0, 255, 255), -1, lineType=cv2.LINE_AA)
for i in range(1, int(rings)):
rr = int(r * (1 + i * ring_spacing))
cv2.circle(frame, (cx, cy), rr, (220, 220, 220), 1, lineType=cv2.LINE_AA)
for i in range(1, len(state["trail"])):
cv2.line(frame, state["trail"][i-1], state["trail"][i], (128, 200, 255), 2, lineType=cv2.LINE_AA)
hud = state["sal_small"]
if hud is not None and hud.size:
if hud.ndim == 2:
hud = cv2.applyColorMap(cv2.cvtColor(hud, cv2.COLOR_GRAY2BGR), cv2.COLORMAP_INFERNO)
h, w = frame.shape[:2]; sh, sw = hud.shape[:2]
y0, x0 = max(0, h - sh - 8), 8
if y0+sh <= h and x0+sw <= w:
frame[y0:y0+sh, x0:x0+sw] = hud
txt = f"({cx:4d},{cy:4d}) r={r:3d} zoom={state['zoom_state']:.2f} Smax={state['peak_sal']:.2f}"
cv2.putText(frame, txt, (10, 26), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,0,0), 3, cv2.LINE_AA)
cv2.putText(frame, txt, (10, 26), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 1, cv2.LINE_AA)
return frame
def process_video(input_path, output_path,
proc_width=512, flow_stride=2,
min_radius=96, max_radius=256,
dwell=4, max_speed=40,
motion_w=0.8, edge_w=0.2,
ring_count=2, ring_spacing=0.75,
stickiness=0.25, stick_sigma_frac=0.08,
switch_margin=0.10, roi_frac=0.35):
cap = cv2.VideoCapture(input_path)
if not cap.isOpened():
raise RuntimeError(f"Could not open input video: {input_path}")
W = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 1
H = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 1
fps = cap.get(cv2.CAP_PROP_FPS)
fps = float(fps) if fps and fps > 0 else 30.0
out = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*"mp4v"), fps, (W, H))
ctrl = StableFoveaController(W, H, proc_width, flow_stride, min_radius, max_radius,
dwell, max_speed, motion_w, edge_w,
sal_ema=0.8, zoom_k=0.6, zoom_ema=0.9,
stickiness=stickiness, stick_sigma_frac=stick_sigma_frac,
switch_margin=switch_margin, roi_frac=roi_frac)
t0 = time.time()
frames = 0
while True:
ret, frame = cap.read()
if not ret or frame is None or frame.size == 0:
break
# If stream dimensions change mid-video, adapt
h, w = frame.shape[:2]
if w != ctrl.SW or h != ctrl.SH:
ctrl.reset_dims(w, h, proc_width)
state = ctrl.step(frame)
draw_overlay(frame, state, rings=ring_count, ring_spacing=ring_spacing)
out.write(frame)
cap.release()
out.release()
dur = time.time() - t0
print(f"Processed {frames} frames in {dur:.2f}s ({frames / max(dur,1e-6):.1f} FPS). Output: {output_path}")
if __name__ == "__main__":
ap = argparse.ArgumentParser(description="Fast+Stable fovea control with robust guards.")
ap.add_argument("--input", "-i", required=True)
ap.add_argument("--output", "-o", required=True)
ap.add_argument("--proc-width", type=int, default=512)
ap.add_argument("--flow-stride", type=int, default=2)
ap.add_argument("--min-radius", type=int, default=96)
ap.add_argument("--max-radius", type=int, default=256)
ap.add_argument("--dwell", type=int, default=4)
ap.add_argument("--max-speed", type=int, default=40)
ap.add_argument("--motion-w", type=float, default=0.8)
ap.add_argument("--edge-w", type=float, default=0.2)
ap.add_argument("--ring-count", type=int, default=2)
ap.add_argument("--ring-spacing", type=float, default=0.75)
ap.add_argument("--stickiness", type=float, default=0.25)
ap.add_argument("--stick-sigma-frac", type=float, default=0.08)
ap.add_argument("--switch-margin", type=float, default=0.10)
ap.add_argument("--roi-frac", type=float, default=0.35)
args = ap.parse_args()
process_video(args.input, args.output,
proc_width=args.proc_width, flow_stride=args.flow_stride,
min_radius=args.min_radius, max_radius=args.max_radius,
dwell=args.dwell, max_speed=args.max_speed,
motion_w=args.motion_w, edge_w=args.edge_w,
ring_count=args.ring_count, ring_spacing=args.ring_spacing,
stickiness=args.stickiness, stick_sigma_frac=args.stick_sigma_frac,
switch_margin=args.switch_margin, roi_frac=args.roi_frac)And here is an add on I created to make faces get more focus (you'll need both for the code above to work.)
face_prior.py
# MIT License - please include reference to DJ-AI.AI - Dayyan James
import cv2
import numpy as np
from typing import List, Tuple, Optional
def _gaussian_2d(h: int, w: int, cx: float, cy: float, sigma: float) -> np.ndarray:
"""Return a normalized 2D Gaussian heatmap centered at (cx, cy)."""
if h <= 0 or w <= 0:
return np.zeros((max(0, h), max(0, w)), dtype=np.float32)
ys = np.arange(h, dtype=np.float32)
xs = np.arange(w, dtype=np.float32)
xx, yy = np.meshgrid(xs, ys)
g = np.exp(-(((xx - cx) ** 2) + ((yy - cy) ** 2)) / (2.0 * sigma * sigma))
m = g.max()
if m > 0:
g /= m
return g.astype(np.float32)
class FacePrior:
"""
Face-prior generator to bias fovea toward faces (and optionally the mouth region).
Works on the *downscaled* gray frame to be cheap, and returns a heatmap
that aligns with your saliency map `S` (same HxW).
Usage:
prior = FacePrior(method="haar") # or method="dnn", with model paths
heat, faces_src = prior.compute_on_small(small_gray, scale_up=1.0/scale, src_shape=(SH, SW))
S = (1 - w_face)*S + w_face*heat
"""
def __init__(self,
method: str = "haar",
haar_face_cascade: Optional[str] = None,
haar_mouth_cascade: Optional[str] = None,
dnn_proto: Optional[str] = None,
dnn_model: Optional[str] = None,
dnn_conf_thresh: float = 0.6,
min_size_frac: float = 0.06,
max_size_frac: float = 0.60,
mouth_boost: float = 0.25):
"""
Args:
method: "haar" or "dnn"
haar_face_cascade: path to Haar cascade xml for frontal face (defaults to OpenCV bundled path)
haar_mouth_cascade: optional mouth cascade (if available)
dnn_proto, dnn_model: paths for OpenCV DNN face detector (Res10 SSD deploy.prototxt + .caffemodel)
dnn_conf_thresh: confidence threshold for DNN detections
min_size_frac / max_size_frac: clamp face sizes (fraction of downscaled shorter side)
mouth_boost: extra heat around lower half (mouth) region when face is found
"""
self.method = method.lower().strip()
self.dnn_conf = float(dnn_conf_thresh)
self.min_size_frac = float(min_size_frac)
self.max_size_frac = float(max_size_frac)
self.mouth_boost = float(mouth_boost)
self.face_cascade = None
self.mouth_cascade = None
self.net = None
if self.method == "haar":
# Use bundled cascade if not provided
if haar_face_cascade is None:
haar_face_cascade = cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
self.face_cascade = cv2.CascadeClassifier(haar_face_cascade)
if haar_mouth_cascade is not None:
self.mouth_cascade = cv2.CascadeClassifier(haar_mouth_cascade)
elif self.method == "dnn":
if not (dnn_proto and dnn_model):
raise ValueError("DNN method requires dnn_proto (deploy.prototxt) and dnn_model (.caffemodel)")
self.net = cv2.dnn.readNetFromCaffe(dnn_proto, dnn_model)
else:
raise ValueError(f"Unknown method: {method}")
def _detect_haar(self, small_gray: np.ndarray) -> List[Tuple[int, int, int, int]]:
h, w = small_gray.shape[:2]
if h == 0 or w == 0:
return []
min_side = min(h, w)
min_size = max(1, int(self.min_size_frac * min_side))
max_size = max(min_size, int(self.max_size_frac * min_side))
faces = self.face_cascade.detectMultiScale(
small_gray,
scaleFactor=1.1,
minNeighbors=4,
flags=cv2.CASCADE_SCALE_IMAGE,
minSize=(min_size, min_size),
maxSize=(max_size, max_size)
)
return [(int(x), int(y), int(w_), int(h_)) for (x, y, w_, h_) in faces]
def _detect_dnn(self, small_gray: np.ndarray) -> List[Tuple[int, int, int, int]]:
# Run SSD on a BGR image; convert gray->BGR
h, w = small_gray.shape[:2]
if h == 0 or w == 0:
return []
bgr = cv2.cvtColor(small_gray, cv2.COLOR_GRAY2BGR)
blob = cv2.dnn.blobFromImage(bgr, 1.0, (300, 300), (104.0, 177.0, 123.0), swapRB=False, crop=False)
self.net.setInput(blob)
det = self.net.forward()
boxes = []
for i in range(det.shape[2]):
conf = float(det[0, 0, i, 2])
if conf < self.dnn_conf:
continue
x1 = int(det[0, 0, i, 3] * w)
y1 = int(det[0, 0, i, 4] * h)
x2 = int(det[0, 0, i, 5] * w)
y2 = int(det[0, 0, i, 6] * h)
x1, y1 = max(0, x1), max(0, y1)
x2, y2 = min(w - 1, x2), min(h - 1, y2)
boxes.append((x1, y1, max(1, x2 - x1), max(1, y2 - y1)))
return boxes
def compute_on_small(self,
small_gray: np.ndarray,
scale_up: float,
src_shape: Tuple[int, int],
return_boxes_src: bool = True
) -> Tuple[np.ndarray, Optional[List[Tuple[int, int, int, int]]]]:
"""
Args:
small_gray: downscaled grayscale frame (HxW), same space as your saliency map
scale_up: factor to map small coords -> source coords (i.e., 1/scale)
src_shape: (H_src, W_src) of original frame
Returns:
heatmap (HxW float32 in [0,1]), faces_src (list of boxes in source coords) if requested
"""
if small_gray is None or small_gray.ndim != 2 or small_gray.size == 0:
return np.zeros((0, 0), np.float32), [] if return_boxes_src else None
if self.method == "haar":
boxes = self._detect_haar(small_gray)
else:
boxes = self._detect_dnn(small_gray)
h, w = small_gray.shape[:2]
heat = np.zeros((h, w), dtype=np.float32)
faces_src = []
for (x, y, bw, bh) in boxes:
cx = x + 0.5 * bw
cy = y + 0.5 * bh
sigma = 0.35 * max(bw, bh) # broad to cover the full face
heat = np.maximum(heat, _gaussian_2d(h, w, cx, cy, sigma))
# Optional "mouth emphasis": boost lower half
if self.mouth_cascade is not None or self.mouth_boost > 0:
mouth_y = y + int(0.65 * bh) # approximate mouth center
mouth_sigma = 0.22 * max(bw, bh)
heat = np.maximum(heat, self.mouth_boost * _gaussian_2d(h, w, cx, mouth_y, mouth_sigma))
if return_boxes_src:
# Map to source coords
sx = int(round(x * scale_up)); sy = int(round(y * scale_up))
sw = int(round(bw * scale_up)); sh = int(round(bh * scale_up))
Hs, Ws = src_shape
sx = max(0, min(Ws - 1, sx)); sy = max(0, min(Hs - 1, sy))
sw = max(1, min(Ws - sx, sw)); sh = max(1, min(Hs - sy, sh))
faces_src.append((sx, sy, sw, sh))
if heat.size:
m = heat.max()
if m > 0:
heat /= m
return heat.astype(np.float32), faces_src if return_boxes_src else NoneOf course the idea will be to have the AI model adjust the settings on the fly, say when it want's to read fine text, it will need to reduce the minimum radius for better localised detail, while increasing the edge and lowering motion. These I will leave the AI to learn to control itself with some gates for early days when it's still learning the world and everything is novel.
Have fun!
UPDATES:
I had a couple of questions about the various args, here's a quick helper to get your going:
If the fovea is jittery / ping-ponging
Increase --stickiness to 0.30β0.45
Biases toward staying near the current point.
Increase --switch-margin to 0.15β0.30
Requires the new target to be clearly better.
Increase --dwell to 6β10
Holds focus for more frames before big moves.
Lower --max-speed to 20β30
Caps per-frame motion so the center doesnβt lurch.
If tracking feels sluggish / wonβt follow genuine motion
Decrease --stickiness to 0.10β0.20
Decrease --switch-margin to 0.05β0.10
Decrease --dwell to 3β4
Increase --max-speed to 40β60
Increase motion signal: --motion-w up to 0.85β0.95, --edge-w down.
If it misses faces / lips
Add the face prior and blend:
Weight: start self.face_weight = 0.30β0.45
Blend: "max" if you want faces to always win, or "add" for softer bias
For mouth emphasis: mouth_boost=0.30β0.50
(If still weak) increase --proc-width to 640β768 so faces survive downscale.
If it chases background motion (flags, trees, traffic)
Increase --stickiness (0.35β0.50)
Increase --switch-margin (0.20β0.30)
Decrease --motion-w a bit (e.g., 0.65β0.75) and increase --edge-w (0.25β0.35)
Keep face prior on if people are present.
If camera shake confuses it (hand-held footage)
Increase --dwell (6β10) and stickiness (0.35β0.50)
Lower --max-speed (20β30)
Raise --switch-margin (0.20β0.30)
If you want tight reading / fine detail focus
Decrease --min-radius (e.g., 72β96) and max ~192β224
Increase --edge-w (0.35β0.50) and lower --motion-w
Increase --proc-width (640+)
Slightly increase --stickiness (0.30β0.40) to avoid wandering
Multiple people in frame (pick one and stick)
Use face prior with "max" blend and moderate weight (0.30β0.40)
Increase --stickiness (0.35β0.45) and --switch-margin (0.15β0.25)
ROI: if you want center bias, raise --roi-frac (0.40β0.50)
Fast action / sports
Increase --max-speed (50β80)
Lower --dwell (2β4) and stickiness (0.15β0.25)
Increase --motion-w (0.85β0.95)
Consider --proc-width 640 for better motion detail