Observability & Telemetry

Version: 4.1 Context: Logging, Metrics, and Tracing standards.

1. Structured Logging (Tracing)

We use the tracing crate to emit structured events.

Level	Usage	Example
ERROR	Operator intervention required. Data loss or Service outage.	`Database connection failed`, `Worker panic`.
WARN	Recoverable issues or degraded performance.	`Rate limit exceeded`, `Job retrying`.
INFO	Lifecycle events.	`Server started`, `Job #123 completed`.
DEBUG	State transitions and payload details.	`User authenticated`, `Config loaded`.
TRACE	Hot-path debugging. DISABLE IN PROD.	`Score calculated: 10.5`, `Mutex acquired`.

Every request and background job must be wrapped in a Span.

We expose a /metrics endpoint for scraping.

Metric Name	Type	Description
`keyforge_jobs_active`	Gauge	Number of jobs currently processing.
`keyforge_jobs_total`	Counter	Total jobs submitted (labeled by status: `success`, `failed`).
`keyforge_score_latency_ns`	Histogram	Time taken to calculate a single layout score.
`keyforge_worker_heartbeats`	Counter	Liveness check for connected agents.

Liveness: /health/live - Returns 200 if the HTTP server is up.
Readiness: /health/ready - Returns 200 if DB is connected and migrations are applied.

For real-time cluster management without a web browser, we provide a Terminal User Interface.

Cluster Map: Visualizes active Worker Nodes, their CPU load, and current Job ID.
Job Queue: Scrollable list of Pending/Running/Failed jobs with "Kill" and "Retry" actions.
Log Stream: Tailing tracing logs with regex filtering.