KeyForge Engineering Doctrine: LLM-Assisted Reliability

Version: 1.1 Date: 2025-12-27 Context: Mitigating the "High-Level Genius / Low-Level Hallucination" gap in AI-generated code.

1. The Problem Statement

Large Language Models (LLMs) excel at architectural synthesis but suffer from Happy Path Bias and Semantic Blindness at the implementation level.

The Gap: Code often compiles and passes basic tests but fails under boundary conditions or violates domain invariants (e.g., negative distances, infinite loops).
The Risk: "Silent Failure." The system produces plausible but mathematically incorrect optimization results, degrading the core value proposition without crashing.

2. Core Philosophy

"You cannot test quality into a product." — W. Edwards Deming

Testing is a measurement activity, not a construction activity. To ensure reliability in an LLM-generated codebase, we must move from Detection (catching bugs after they exist) to Prevention (making bugs unrepresentable).

3. Strategy: Coverage-Driven Verification (CDV)

We use code coverage tools (tarpaulin) not to gamify metrics, but to identify Logic Dark Matter—code branches generated by the LLM that have never been executed or verified.

Criticality Tiering

Tier 1 (The Nucleus): keyforge-physics, keyforge-evolution.
Risk: Mathematical hallucination.
Requirement: 95%+ Branch Coverage + Property Testing.
Tier 2 (The Contract): keyforge-protocol, keyforge-model.
Risk: Data corruption.
Requirement: 100% Validation Coverage.
Tier 3 (The Shell): keyforge-infra, keyforge-hive.
Risk: IO/State failure.
Requirement: Error Path Verification.

4. Architectural Guardrails

To constrain the LLM, we define a strict "grammar" of Types and Traits. The LLM acts as a translator; if the translation violates the grammar, the compiler rejects it.

A. Type-Driven Development (The Compiler Guardrail)

Replace primitives (usize, f32) with Semantic Newtypes to prevent logic errors (e.g., swapping rows/cols).

Before: fn dist(x1: f32, y1: f32, x2: f32, y2: f32)
After: fn dist(a: Point, b: Point) -> Distance

B. The Validator Trait (The Runtime Guardrail)

Enforce validity at the constructor level. An object cannot exist in an invalid state.

Pattern: impl TryFrom<RawData> for ValidatedData
Effect: The LLM cannot write tests using broken mock data because the constructors will panic.

C. Property-Based Testing (The Logic Guardrail)

Test Invariants (Rules) instead of Instances (Values).

Instead of: assert_eq!(add(2,2), 4)
Use: proptest! { fn addition_is_commutative(a, b) { assert_eq!(add(a,b), add(b,a)) } }

5. Research-Backed Protocols (v1.1)

Incorporating findings from Shinn et al. (2023) and Dhuliawala et al. (2023).

A. The Reflexion Loop

When a property test fails, we do not simply ask for a fix. We enforce a reflection step:

Input: Failing test case + Code.
Task: "Explain the logic error in natural language."
Action: "Generate the fix based on that explanation."

Goal: Break the loop of blind retries.

B. Chain-of-Verification (CoVe)

For critical Tier 1 logic, the generation process is split:

Draft: Generate the function.
Verify: Generate 3 edge cases and manually calculate the expected result.
Final: Refine the code to satisfy the verification cases.

C. Invariant Documentation (Formal Verification Lite)

We explicitly document invariants using Kani-style assertions in comments, forcing the LLM to "think" in formal proofs.

Example: // INVARIANT: kani::assume(score >= 0.0);

6. Structural Patterns

A. Functional Core, Imperative Shell

Isolate pure logic from the messy real world.

The Core: (physics) Pure functions only. Deterministic. No IO.
The Shell: (infra) Handles Files, Network, Time.
Benefit: If the Core fails, it is a logic bug (reproducible). If the Shell fails, it is an environment issue.

B. "Parse, Don't Validate"

Do not check for validity deep in the call stack. Parse data into a valid type at the boundary.

Weak: Checking if list.is_empty() inside a math loop.
Strong: Accepting NonEmptyVec<T> as the argument.

C. Shadow Execution (The Oracle Pattern)

Maintain a "Simple, Slow, Correct" reference implementation (DeterministicScorer) alongside the "Fast, Complex" optimized version (ScoringEngine).

Verification: Run both in tests. If Fast != Slow, the optimization is flawed.

7. Execution Plan

Baseline: Generate Tarpaulin coverage report to map "Ignorance."
Target: keyforge-physics (Highest Risk).
Action: Apply Type Guardrails, Property Tests, and Reflexion Loops to uncovered regions.

. The Typestate Pattern (The "Compiler State Machine") The Problem: LLMs struggle with temporal coupling (e.g., "You must call init() before process()"). They frequently write code that accesses data before it is ready. The Fix: Encode the state in the Type System, not in boolean flags. The Work Product: Instead of a single Job struct with an Option, define distinct types for each state. code Rust // Bad (LLM Trap): struct Job { id: String, status: String, // "pending", "running" result: Option, // LLM forgets to check if this is None }

// Good (LLM Railing): struct PendingJob { id: String, config: Config } struct RunningJob { id: String, start_time: Instant } struct CompletedJob { id: String, score: f32 }

// The Transition (The only way to move forward) impl PendingJob { fn start(self) -> RunningJob { ... } } impl RunningJob { fn finish(self, score: f32) -> CompletedJob { ... } } Why it works: The LLM cannot write code that accesses the score of a pending job. The compiler prevents the hallucination. It forces the LLM to follow the linear flow of the system. 2. The "Parameter Object" Pattern (Context Structs) The Problem: "Argument Swapping." In physics, you have functions taking multiple u16 or f32 arguments. Even with Newtypes, LLMs struggle with functions that take 5+ arguments. The Fix: Group cohesive arguments into a Context Struct. The Work Product: code Rust // libs/keyforge-physics/src/kernel/types.rs

// Instead of passing (layout, corpus, rubric, weights) everywhere: pub struct ScoringContext<'a> { pub layout: &'a Layout, pub corpus: &'a Corpus, pub rubric: &'a Rubric, }

// Function signature becomes simple: fn calculate_score(ctx: ScoringContext) -> Score { ... } Why it works: Token Efficiency: Reduces repetition in function signatures. Safety: You construct the Context once (validated). The LLM just passes it around. It cannot accidentally pass corpus where rubric is expected because they are fields, not positional arguments. 3. The Command Pattern (Reified Actions) The Problem: In keyforge-hive, business logic often gets mixed into HTTP handlers (axum). This makes it hard to test and hard for the LLM to reason about "what happens" vs "how it is served." The Fix: Decouple the Intent from the Execution. The Work Product: Define actions as data structures (Commands), not functions. code Rust // libs/keyforge-core/src/commands.rs

pub enum HiveCommand { RegisterJob(JobRequest), CancelJob(String), SubmitResult(ResultSubmission), }

// The Handler (Pure Logic) pub fn handle_command(cmd: HiveCommand, state: &mut AppState) -> Result { match cmd { HiveCommand::RegisterJob(req) => { // Pure logic: Update state, return event } // ... } } Why it works: Isolation: You can ask the LLM to "Implement the handler for RegisterJob" without pasting any Axum/HTTP code. Testability: You can test the logic by constructing a HiveCommand struct, without spinning up a server. 4. The Centralized Error Registry (The "Failure Catalog") The Problem: LLMs are lazy with errors. They default to anyhow::anyhow!("error") or unwrap(). This makes production debugging a nightmare. The Fix: A strict, enumerated catalog of every possible failure state. The Work Product: A dedicated errors.rs in keyforge-model or keyforge-protocol. code Rust // libs/keyforge-protocol/src/errors.rs use thiserror::Error;

[derive(Error, Debug)]

pub enum ForgeError { #[error("Physics Violation: Score {0} is negative")] PhysicsViolation(f32),

#[error("Invalid Layout: Key count {0} != Geometry {1}")]
LayoutMismatch(usize, usize),

#[error("Stale Job: Job {0} was cancelled")]
StaleJob(String),

} The Protocol: Constraint: "You are forbidden from using anyhow! or unwrap(). You must return a Result. If a specific error variant does not exist, ask me to add it to the Registry." Why it works: It forces the LLM to categorize the error. It cannot just "bail out"; it has to think about what went wrong, which often helps it realize the logic is flawed. Summary of the "Robust" Architecture Typestate: Make invalid sequences impossible (Pending -> Running). Parameter Objects: Make argument swapping impossible (ScoringContext). Command Pattern: Make logic independent of the web server (HiveCommand). Error Registry: Make lazy error handling impossible (ForgeError). These elements reduce the Cognitive Load on the LLM. It doesn't have to remember "is this job running?" or "which argument is the corpus?" because the Types tell it.