Observability & Telemetry
Version: 4.1 Context: Logging, Metrics, and Tracing standards.
1. Structured Logging (Tracing)
We use the tracing crate to emit structured events.
Log Levels
| Level | Usage | Example |
|---|---|---|
| ERROR | Operator intervention required. Data loss or Service outage. | Database connection failed, Worker panic. |
| WARN | Recoverable issues or degraded performance. | Rate limit exceeded, Job retrying. |
| INFO | Lifecycle events. | Server started, Job #123 completed. |
| DEBUG | State transitions and payload details. | User authenticated, Config loaded. |
| TRACE | Hot-path debugging. DISABLE IN PROD. | Score calculated: 10.5, Mutex acquired. |
Spans
Every request and background job must be wrapped in a Span.
http_request:{ method, path, request_id, user_id }job_execution:{ job_id, worker_id, attempt }annealing_loop:{ steps, start_temp }
2. Metrics (Prometheus)
We expose a /metrics endpoint for scraping.
Key Indicators (SLIs)
| Metric Name | Type | Description |
|---|---|---|
keyforge_jobs_active |
Gauge | Number of jobs currently processing. |
keyforge_jobs_total |
Counter | Total jobs submitted (labeled by status: success, failed). |
keyforge_score_latency_ns |
Histogram | Time taken to calculate a single layout score. |
keyforge_worker_heartbeats |
Counter | Liveness check for connected agents. |
3. Health Checks
- Liveness:
/health/live- Returns 200 if the HTTP server is up. - Readiness:
/health/ready- Returns 200 if DB is connected and migrations are applied.
4. Admin TUI (The Control Plane)
For real-time cluster management without a web browser, we provide a Terminal User Interface.
- Tool:
ratatui(Rust). - Binary:
keyforge-admin. - Connectivity: Connects via SSH tunnel or local socket to the Hive API.
Views
- Cluster Map: Visualizes active Worker Nodes, their CPU load, and current Job ID.
- Job Queue: Scrollable list of Pending/Running/Failed jobs with "Kill" and "Retry" actions.
- Log Stream: Tailing
tracinglogs with regex filtering.