Skip to content

Failure Mode Analysis (FMA)

Version: 4.1 Context: Known failure states and recovery strategies.

1. Infrastructure Failures

Component Failure Mode Detection Recovery Strategy
Database Connection Lost InfraError::DbConnection Retry (Exponential Backoff). If exhausted, return 503 Service Unavailable.
Coordination Valkey Unreachable InfraError::Io Degrade. Real-time stats freeze. Write-shield fails open (assume new profile).
FileSystem Disk Full InfraError::Io Crash. Alert Admin. No automatic recovery possible.
Network Upstream Timeout InfraError::Network Retry (3 attempts). If failed, serve stale cache if available, else Error.

2. Application Failures

Component Failure Mode Detection Recovery Strategy
Worker Node Crash / Timeout Heartbeat Missed Re-queue. The Job is returned to the queue for another worker to pick up.
Physics Panic (e.g., Div by 0) std::panic::catch_unwind Isolate. The specific optimization run fails, but the Worker process survives. Report error to Hive.
Job Poison Pill (Invalid Config) Validation Error Reject. Mark Job as Failed in DB. Do not re-queue.

3. Logic Failures

Component Failure Mode Detection Recovery Strategy
Scoring Score Overflow Score Type Bounds Saturate. Cap at Score::MAX. Log warning.
Evolution Stagnation (Local Minima) patience counter Reheat. Increase temperature to escape. If failed, terminate early.