KeyForge Engineering Doctrine: LLM-Assisted Reliability
Version: 1.1 Date: 2025-12-27 Context: Mitigating the "High-Level Genius / Low-Level Hallucination" gap in AI-generated code.
1. The Problem Statement
Large Language Models (LLMs) excel at architectural synthesis but suffer from Happy Path Bias and Semantic Blindness at the implementation level.
- The Gap: Code often compiles and passes basic tests but fails under boundary conditions or violates domain invariants (e.g., negative distances, infinite loops).
- The Risk: "Silent Failure." The system produces plausible but mathematically incorrect optimization results, degrading the core value proposition without crashing.
2. Core Philosophy
"You cannot test quality into a product." — W. Edwards Deming
Testing is a measurement activity, not a construction activity. To ensure reliability in an LLM-generated codebase, we must move from Detection (catching bugs after they exist) to Prevention (making bugs unrepresentable).
3. Strategy: Coverage-Driven Verification (CDV)
We use code coverage tools (tarpaulin) not to gamify metrics, but to identify Logic Dark Matter—code branches generated by the LLM that have never been executed or verified.
Criticality Tiering
- Tier 1 (The Nucleus):
keyforge-physics,keyforge-evolution. - Risk: Mathematical hallucination.
-
Requirement: 95%+ Branch Coverage + Property Testing.
-
Tier 2 (The Contract):
keyforge-protocol,keyforge-model. - Risk: Data corruption.
-
Requirement: 100% Validation Coverage.
-
Tier 3 (The Shell):
keyforge-infra,keyforge-hive. - Risk: IO/State failure.
- Requirement: Error Path Verification.
4. Architectural Guardrails
To constrain the LLM, we define a strict "grammar" of Types and Traits. The LLM acts as a translator; if the translation violates the grammar, the compiler rejects it.
A. Type-Driven Development (The Compiler Guardrail)
Replace primitives (usize, f32) with Semantic Newtypes to prevent logic errors (e.g., swapping rows/cols).
- Before:
fn dist(x1: f32, y1: f32, x2: f32, y2: f32) - After:
fn dist(a: Point, b: Point) -> Distance
B. The Validator Trait (The Runtime Guardrail)
Enforce validity at the constructor level. An object cannot exist in an invalid state.
- Pattern:
impl TryFrom<RawData> for ValidatedData - Effect: The LLM cannot write tests using broken mock data because the constructors will panic.
C. Property-Based Testing (The Logic Guardrail)
Test Invariants (Rules) instead of Instances (Values).
- Instead of:
assert_eq!(add(2,2), 4) - Use:
proptest! { fn addition_is_commutative(a, b) { assert_eq!(add(a,b), add(b,a)) } }
5. Research-Backed Protocols (v1.1)
Incorporating findings from Shinn et al. (2023) and Dhuliawala et al. (2023).
A. The Reflexion Loop
When a property test fails, we do not simply ask for a fix. We enforce a reflection step:
- Input: Failing test case + Code.
- Task: "Explain the logic error in natural language."
- Action: "Generate the fix based on that explanation."
Goal: Break the loop of blind retries.
B. Chain-of-Verification (CoVe)
For critical Tier 1 logic, the generation process is split:
- Draft: Generate the function.
- Verify: Generate 3 edge cases and manually calculate the expected result.
- Final: Refine the code to satisfy the verification cases.
C. Invariant Documentation (Formal Verification Lite)
We explicitly document invariants using Kani-style assertions in comments, forcing the LLM to "think" in formal proofs.
- Example:
// INVARIANT: kani::assume(score >= 0.0);
6. Structural Patterns
A. Functional Core, Imperative Shell
Isolate pure logic from the messy real world.
- The Core: (
physics) Pure functions only. Deterministic. No IO. - The Shell: (
infra) Handles Files, Network, Time. - Benefit: If the Core fails, it is a logic bug (reproducible). If the Shell fails, it is an environment issue.
B. "Parse, Don't Validate"
Do not check for validity deep in the call stack. Parse data into a valid type at the boundary.
- Weak: Checking
if list.is_empty()inside a math loop. - Strong: Accepting
NonEmptyVec<T>as the argument.
C. Shadow Execution (The Oracle Pattern)
Maintain a "Simple, Slow, Correct" reference implementation (DeterministicScorer) alongside the "Fast, Complex" optimized version (ScoringEngine).
- Verification: Run both in tests. If
Fast != Slow, the optimization is flawed.
7. Execution Plan
- Baseline: Generate Tarpaulin coverage report to map "Ignorance."
- Target:
keyforge-physics(Highest Risk). - Action: Apply Type Guardrails, Property Tests, and Reflexion Loops to uncovered regions.
. The Typestate Pattern (The "Compiler State Machine")
The Problem: LLMs struggle with temporal coupling (e.g., "You must call init() before process()"). They frequently write code that accesses data before it is ready.
The Fix: Encode the state in the Type System, not in boolean flags.
The Work Product:
Instead of a single Job struct with an Option
// Good (LLM Railing): struct PendingJob { id: String, config: Config } struct RunningJob { id: String, start_time: Instant } struct CompletedJob { id: String, score: f32 }
// The Transition (The only way to move forward) impl PendingJob { fn start(self) -> RunningJob { ... } } impl RunningJob { fn finish(self, score: f32) -> CompletedJob { ... } } Why it works: The LLM cannot write code that accesses the score of a pending job. The compiler prevents the hallucination. It forces the LLM to follow the linear flow of the system. 2. The "Parameter Object" Pattern (Context Structs) The Problem: "Argument Swapping." In physics, you have functions taking multiple u16 or f32 arguments. Even with Newtypes, LLMs struggle with functions that take 5+ arguments. The Fix: Group cohesive arguments into a Context Struct. The Work Product: code Rust // libs/keyforge-physics/src/kernel/types.rs
// Instead of passing (layout, corpus, rubric, weights) everywhere: pub struct ScoringContext<'a> { pub layout: &'a Layout, pub corpus: &'a Corpus, pub rubric: &'a Rubric, }
// Function signature becomes simple: fn calculate_score(ctx: ScoringContext) -> Score { ... } Why it works: Token Efficiency: Reduces repetition in function signatures. Safety: You construct the Context once (validated). The LLM just passes it around. It cannot accidentally pass corpus where rubric is expected because they are fields, not positional arguments. 3. The Command Pattern (Reified Actions) The Problem: In keyforge-hive, business logic often gets mixed into HTTP handlers (axum). This makes it hard to test and hard for the LLM to reason about "what happens" vs "how it is served." The Fix: Decouple the Intent from the Execution. The Work Product: Define actions as data structures (Commands), not functions. code Rust // libs/keyforge-core/src/commands.rs
pub enum HiveCommand { RegisterJob(JobRequest), CancelJob(String), SubmitResult(ResultSubmission), }
// The Handler (Pure Logic)
pub fn handle_command(cmd: HiveCommand, state: &mut AppState) -> Result
[derive(Error, Debug)]
pub enum ForgeError { #[error("Physics Violation: Score {0} is negative")] PhysicsViolation(f32),
#[error("Invalid Layout: Key count {0} != Geometry {1}")]
LayoutMismatch(usize, usize),
#[error("Stale Job: Job {0} was cancelled")]
StaleJob(String),
}
The Protocol:
Constraint: "You are forbidden from using anyhow! or unwrap(). You must return a Result