We ran 145 optimization trials in 297 seconds and got zero promotions. Then we changed one variable the optimizer wasn't allowed to touch and got a 531% altitude gain. That contrast is the entire lesson.
We took OpenRocket — an open-source model rocket simulator — stripped out the GUI, and wired it into an EGRI (Evaluator-Governed Recursive Improvement) loop. The pipeline: a headless CLI that outputs structured JSON, a Python harness that runs grid sweeps with constraint checking, a JSONL ledger that records every trial, and a formal problem spec that defines the optimization contract. We then expanded the mutation surface across 5 motor configurations, 15 rocket designs, and extended parameter ranges to measure exactly where the altitude sensitivity lives.
The Stack
| Layer | Tool | Role |
|---|---|---|
| Evaluator | OpenRocket core (Java 17) | 6DOF physics simulation, 2.06s avg per trial |
| CLI | rocket-sim |
Headless interface: run, info, sweep, events |
| Harness | EGRI loop (Python) | Grid sweep, constraint checking, JSONL ledger, auto-promotion |
| Problem spec | problem-spec.yaml |
Formal EGRI contract with objective, constraints, budget |
| Skill | openrocket-sim |
Reusable agent skill for future sessions |
Why Simulation Is the Ideal EGRI Evaluator
EGRI's core law: never grant more mutation freedom than your evaluator can reliably judge.
A physics simulation satisfies this perfectly:
- Deterministic — same inputs always produce the same outputs (we verified: baseline altitude = 50.538m across runs)
- Fast — 2.06s average per trial, enabling 145 trials in under 5 minutes
- Trusted — the evaluator is 6DOF physics, not a heuristic or LLM judge
- Structured — outputs are typed scalars: altitude, velocity, Mach, acceleration, flight time, ground hit velocity
- Constraint-checkable — hard limits enforced per trial, violations logged to ledger
This means we can safely run in auto-promote mode — the strongest autonomy level in EGRI. No human gate needed. The evaluator is the gate.
The Rocket Under Test
The subject: "A simple model rocket" — a single-stage design with an Estes A8-3 motor.
Component Tree:
├── Nose cone — Ogive, 100mm, 13g
├── Body tube — 300mm × 26mm OD, 15g
│ ├── Parachute — 300mm diameter
│ ├── Shock cord — 1g
│ ├── Wadding — 2g
│ ├── Launch lug
│ ├── Trapezoidal fins (×3) — 30mm span, 6g
│ ├── Centering rings (×2)
│ └── Inner Tube (motor mount) — 75mm, A8-3 motor
└── Total dry mass: ~46g
Flight timeline (A8-3 motor, sea level, no wind):
| Event | Time |
|---|---|
| Launch / Motor ignition | 0.000s |
| Lift-off | 0.105s |
| Launch rod clearance | 0.247s |
| Motor burnout | 0.730s |
| Apogee (50.5m) | 3.467s |
| Ejection charge | 3.730s |
| Parachute deployment | 3.731s |
| Ground hit (4.7 m/s) | 15.870s |
The motor burns for just 0.73 seconds. Everything after that is ballistic coast + parachute descent. This is a 16-second flight where the first 730 milliseconds determine everything.
What We Built
rocket-sim CLI
Four commands, all outputting structured JSON:
# Inspect rocket component tree + motor configurations
rocket-sim info "A simple model rocket.ork"
# → 5 flight configs: [A8-3], [B4-4], [C6-3], [C6-5], [C6-7]
# Run simulation with specific motor config
rocket-sim run "A simple model rocket.ork" 0 # A8-3 motor
rocket-sim run "A simple model rocket.ork" 1 # B4-4 motor
# Parameter sweep (single variable, N data points)
rocket-sim sweep "A simple model rocket.ork" wind_speed 0,2,5,8,10
# Flight event timeline
rocket-sim events "A simple model rocket.ork"
The CLI wraps OpenRocket's headless core module — a pure Java 17 simulation engine with zero GUI dependency. OpenRocketCore.initialize() bootstraps in one call; each Simulation.simulate() runs the full 6DOF solver.
EGRI Problem Spec
The optimization was formalized as a complete EGRI contract:
name: "rocket-optimization"
objective:
metric: max_altitude_m
direction: maximize
secondary_metrics: [ground_hit_velocity_ms, max_velocity_ms, flight_time_s, max_mach]
constraints:
- "ground_hit_velocity_ms <= 10.0" # Safe recovery
- "max_mach < 1.0" # Subsonic only
- "flight_time_s > 5.0" # Minimum flight time
- "runtime_s <= 30" # Per-trial timeout
artifacts:
mutable: artifacts/current_params.json # Launch parameters
immutable: rocket-tools.jar, .ork files # Evaluator + designs
promotion:
policy: keep_if_improves
threshold: 0.5 # Must improve altitude by >0.5m
autonomy:
mode: auto-promote
escalation_triggers:
- constraint_violation_detected
- 10_consecutive_trials_no_improvement
- budget_90_percent_exhausted
budget:
max_trials: 100
time_per_trial_s: 30
total_time_s: 3600
Key design decisions:
- Auto-promote mode because the evaluator is deterministic physics — no risk of gaming
- 0.5m promotion threshold to filter noise (the stochastic wind model adds ~0.1m variance)
- Escalation triggers so the loop knows when to ask for help expanding the mutation surface
Phase 1: The Grid Sweep (145 Trials)
We swept 4 launch parameters across 144 combinations plus 1 baseline:
| Parameter | Values | Count |
|---|---|---|
| Rod length | 0.5, 1.0, 1.5, 2.0 m | 4 |
| Rod angle | 0, 5, 10 deg | 3 |
| Launch altitude | 0, 500, 1000, 2000 m ASL | 4 |
| Wind speed | 0, 2, 5 m/s | 3 |
| Total | 4 × 3 × 4 × 3 | 144 |
Execution Metrics
| Metric | Value |
|---|---|
| Wall clock time | 297.2s (4.95 min) |
| Total simulation time | 299.2s |
| Average per trial | 2.063s |
| Throughput | 0.49 trials/sec |
| Promotions | 0 |
| Constraint violations | 0 |
| Ledger format | JSONL, 145 entries |
Statistical Summary (All 145 Trials)
| Metric | Min | Max | Mean | Std Dev | Range |
|---|---|---|---|---|---|
| max_altitude_m | 50.408 | 50.853 | 50.664 | 0.086 | 0.445 |
| max_velocity_ms | 29.181 | 29.293 | 29.244 | 0.022 | 0.112 |
| max_acceleration_ms2 | 143.617 | 143.698 | 143.667 | 0.015 | 0.081 |
| max_mach | 0.086 | 0.087 | 0.086 | 0.0004 | 0.001 |
| flight_time_s | 15.856 | 15.964 | 15.918 | 0.021 | 0.108 |
| ground_hit_velocity_ms | 4.414 | 4.841 | 4.632 | 0.081 | 0.427 |
| runtime_s | 1.915 | 2.473 | 2.063 | 0.090 | 0.558 |
The altitude range across all 144 candidates: 0.445 meters. Standard deviation: 8.6 centimeters. The promotion threshold was 0.5m. No candidate ever exceeded it.

Parameter Sensitivity (Altitude Delta per Variable)
| Parameter | Best Value | Worst Value | Delta |
|---|---|---|---|
| Launch altitude | 1000m ASL | 2000m ASL | 0.068m |
| Rod length | 2.0m | 0.5m | 0.060m |
| Wind speed | 0 m/s | 5 m/s | 0.027m |
| Rod angle | 10° | 5° | 0.010m |
Every parameter's sensitivity is measured in millimeters. The dominant variable — rod length at 60mm — is barely distinguishable from simulation noise. Wind speed at 27mm effect is within the stochastic variance band.
The grid did its job. It exhaustively proved that launch parameters are a dead mutation surface for this rocket. That is a measurement, not a failure.
Phase 2: Expanded Sweeps (What the EGRI Loop Told Us to Do)
The null result triggered the escalation logic: 10+ consecutive trials without improvement. Following EGRI protocol, the next move is clear — expand the mutation surface. We did this in three steps: push existing parameters to wider ranges, test motor configurations, and benchmark across rocket designs.
Individual Parameter Sweeps (Extended Range)
Wind speed (0–10 m/s, 11 data points) — the first parameter that actually moves the needle when pushed to extremes:
| Wind (m/s) | Altitude (m) | Ground Hit (m/s) | Delta from Calm |
|---|---|---|---|
| 0 | 51.14 | 4.16 | — |
| 1 | 50.96 | 4.33 | -0.18 |
| 2 | 50.65 | 4.67 | -0.49 |
| 3 | 50.30 | 5.40 | -0.84 |
| 4 | 49.65 | 5.40 | -1.49 |
| 5 | 48.68 | 6.69 | -2.46 |
| 6 | 47.29 | 7.35 | -3.85 |
| 7 | 47.58 | 9.12 | -3.56 |
| 8 | 46.53 | 9.26 | -4.61 |
| 9 | 46.04 | 8.91 | -5.10 |
| 10 | 45.70 | 9.82 | -5.44 |
At 10 m/s wind, altitude drops 10.6% and ground hit velocity approaches the 10 m/s safety constraint. The relationship is non-linear — the degradation accelerates past 4 m/s.

Rod angle (0–30°, 9 data points) — strong effect at extreme angles:
| Angle (°) | Altitude (m) | Delta | Velocity (m/s) |
|---|---|---|---|
| 0 | 50.71 | — | 29.25 |
| 2 | 50.57 | -0.14 | 29.29 |
| 5 | 49.87 | -0.84 | 29.28 |
| 10 | 48.22 | -2.49 | 29.37 |
| 15 | 45.73 | -4.98 | 29.48 |
| 20 | 42.68 | -8.03 | 29.67 |
| 25 | 38.98 | -11.73 | 29.89 |
| 30 | 35.37 | -15.34 | 30.18 |
At 30°, altitude drops 30% while max velocity actually increases — the rocket accelerates more on the tilted rod but wastes energy on horizontal flight. This shows why the original grid (0–10°) was too conservative to observe the effect.

Launch altitude (0–4000m ASL, 9 data points) — thinner air = higher altitude:
| Altitude ASL (m) | Apogee (m) | Delta | Ground Hit (m/s) |
|---|---|---|---|
| 0 | 50.65 | — | 4.47 |
| 500 | 50.99 | +0.34 | 4.68 |
| 1000 | 51.17 | +0.52 | 4.87 |
| 1500 | 51.63 | +0.98 | 4.82 |
| 2000 | 51.79 | +1.14 | 5.01 |
| 3000 | 52.47 | +1.82 | 5.35 |
| 4000 | 52.93 | +2.28 | 5.45 |
4000m ASL (roughly La Paz, Bolivia elevation) gains 2.28m — the only parameter that consistently improves altitude. Lower air density = less drag. But the effect is still dwarfed by motor selection.
The Motor Configuration Test (The Real Mutation Surface)
The simple rocket has 5 pre-configured motor options. We ran all 5 with identical launch parameters:
| Motor | Altitude (m) | Velocity (m/s) | Mach | Flight Time (s) | Ground Hit (m/s) | vs. A8-3 |
|---|---|---|---|---|---|---|
| A8-3 | 50.7 | 29.2 | 0.086 | 15.9 | 4.60 | baseline |
| B4-4 | 135.1 | 53.2 | 0.157 | 37.9 | 4.66 | +167% |
| C6-3 | 278.7 | 95.3 | 0.281 | 72.6 | 4.67 | +450% |
| C6-5 | 316.5 | 95.3 | 0.281 | 83.5 | 4.48 | +525% |
| C6-7 | 320.0 | 95.3 | 0.281 | 84.5 | 4.51 | +531% |
This is the table that proves the thesis. Switching from A8-3 to C6-7 yields a 269.3-meter gain — 605 times larger than the entire 0.445m range produced by sweeping all 144 launch parameter combinations. All five configurations pass every constraint: ground hit velocity stays under 5 m/s, Mach stays well subsonic.
The C6-3 vs. C6-5 vs. C6-7 comparison is also instructive: same motor impulse, different ejection delay. C6-7 (7-second delay) gains 41.3m over C6-3 (3-second delay) purely from better delay timing — the parachute deploys closer to apogee instead of during the coast phase.

Cross-Design Benchmark (15 Rockets)
We ran every example .ork file in the OpenRocket distribution:
| Rocket Design | Altitude (m) | Velocity (m/s) | Mach | Flight Time (s) | Hit (m/s) |
|---|---|---|---|---|---|
| Pods (winglets) | 31.9 | 24.7 | 0.073 | 10.3 | 4.99 |
| Simple model (A8-3) | 50.5 | 29.2 | 0.087 | 15.9 | 4.69 |
| Clustered motors | 57.9 | 32.4 | 0.095 | 13.8 | 6.33 |
| Base drag hack | 98.2 | 47.1 | 0.139 | 27.2 | 4.56 |
| Three stage LP | 270.6 | 78.5 | 0.232 | 72.6 | 4.50 |
| Tube fin | 282.5 | 119.4 | 0.351 | 75.6 | 4.43 |
| Chute release | 307.7 | 72.1 | 0.213 | 51.8 | 5.19 |
| Pods (powered) | 342.3 | 92.4 | 0.273 | 82.0 | 4.90 |
| ARC payload | 452.0 | 150.3 | 0.442 | 110.2 | 4.61 |
| Dual parachute | 593.1 | 135.4 | 0.398 | 65.1 | 4.72 |
| Two stage HP | 676.3 | 158.9 | 0.469 | 64.5 | 6.23 |
| Parallel booster | 1121.2 | 210.9 | 0.623 | 224.5 | 5.54 |
| Airstart timing | 1318.5 | 186.4 | 0.550 | 93.3 | 5.29 |
| Sim extensions | 2466.3 | 240.8 | 0.715 | 169.9 | 7.26 |
| Sim scripting | 2464.7 | 240.8 | 0.715 | 169.6 | 7.16 |
A 77× altitude range from 31.9m to 2466.3m — all from the same physics engine, same weather model, same evaluation pipeline. The only variables: rocket geometry, staging, motor selection, and recovery systems.

Wind Sensitivity by Design Class
Wind affects complex rockets far more than simple ones:
| Rocket | Calm (m) | 10 m/s wind (m) | Loss | Loss % |
|---|---|---|---|---|
| Simple model (A8-3) | 51.1 | 45.7 | -5.4 | -10.6% |
| Three stage LP | 277.2 | 203.0 | -74.2 | -26.8% |
| Two stage HP | 679.6 | 642.9 | -36.7 | -5.4% |
The three-stage rocket loses 74 meters to 10 m/s wind — 26.8% of its calm-air altitude. The two-stage high-power design is more wind-resistant at only 5.4% loss, likely due to higher velocity and shorter coast time. This is exactly the kind of cross-design insight an EGRI loop can surface when given the right mutation surface.
The Insight: Mutation Surface Determines Everything
Here is what the data proves, with measured numbers:
| Mutation Surface | Range | Effect on Altitude |
|---|---|---|
| Launch parameters (what we swept) | 0.445m | 0.88% of baseline |
| Extended wind (0→10 m/s) | 5.44m | 10.6% of baseline |
| Extended rod angle (0→30°) | 15.34m | 30.3% of baseline |
| Motor selection (A8→C6-7) | 269.3m | 531% of baseline |
| Different rocket design (pods→sim ext.) | 2434.4m | 7634% of lowest |
The 144-trial grid sweep exhaustively proved that launch parameters are a dead mutation surface for a simple model rocket. That null result is not a failure — it is the evaluator telling you the truth. Zero promotions means the optimizer has converged, and convergence at baseline means the mutable artifact does not contain the dominant variable.
This is the general principle: in any EGRI loop, the first question is not "how many trials?" but "what are we allowed to change?" If the mutable artifact does not contain the dominant variable, the loop will converge immediately to a local optimum that looks identical to the baseline. Running more trials on the wrong surface is pure waste.

EGRI Anatomy of This Run
For anyone implementing EGRI in their own domain, here is how each component mapped:
| EGRI Component | This Implementation |
|---|---|
| Mutable artifact | artifacts/current_params.json — 4 launch parameters |
| Immutable evaluator | rocket-tools.jar + OpenRocket physics engine |
| Objective | Maximize max_altitude_m |
| Constraints | ground_hit ≤ 10 m/s, Mach < 1.0, flight_time > 5s |
| Promotion policy | keep_if_improves with 0.5m threshold |
| Search strategy | grid_then_llm (Phase 1: exhaustive grid, Phase 2: LLM-guided) |
| Autonomy mode | auto-promote (deterministic evaluator = no human gate) |
| Ledger | JSONL, 145 entries, every trial with full metrics + timestamps |
| Budget | 100 trials max, 30s/trial timeout, 3600s total |
| Escalation | Triggered at 10 consecutive no-improvement trials |
The escalation trigger fired at trial 10 and kept firing. In a fully autonomous system, this would have triggered mutation surface expansion automatically — adding motor selection to the mutable artifact set. We did this manually in Phase 2 to measure the effect cleanly.
What This Means for EGRI Generally
The rocket optimization is a clean proof of three EGRI principles:
1. A null result from a trusted evaluator is informative, not failed. Zero promotions across 144 trials proves that launch parameters are dominated. This is a measurement, not an error. The evaluator earned our trust by being deterministic physics — we believe the null result because we believe Newtonian mechanics.
2. The evaluator should outlive the mutation surface.
We expanded from launch parameters to motor configs to cross-design benchmarks, all using the same rocket-sim evaluator. The evaluator infrastructure (CLI, JSON output, constraint checker, JSONL ledger) amortizes across every expansion. Build the evaluator first, then widen the search.
3. Grid search is the right first move when the search space is small. 4 × 3 × 4 × 3 = 144 combinations took 5 minutes and exhaustively covered the space. No heuristic, no LLM, no gradient. When you can enumerate, enumerate. Save the LLM-guided phase for spaces too large to grid — like component geometry or motor selection from a database of 500+ motors.
What Comes Next
The evaluator, harness, ledger, and constraint system are all in place. Every future expansion reuses the same pipeline. The mutation surface is ready to widen:
- Motor database sweep — OpenRocket ships hundreds of motor definitions (Estes, AeroTech, Cesaroni). Grid-sweep the full catalog with constraint checking. This is the obvious next EGRI loop given what the data showed.
- Component geometry EGRI — fin span, nose cone shape/length, body tube dimensions. This is where LLM-guided search earns its keep — the design space is too large for exhaustive grid.
- Multi-objective Pareto — altitude vs. ground hit velocity vs. cost. Map the Pareto frontier and let the evaluator surface the tradeoffs.
- Cross-design EGRI — optimize across rocket archetypes, not just parameters within one design. The 15-rocket benchmark is the seed data for this.
Install the Skill
npx skills add broomva/openrocket-sim
The skill gives any Claude Code session access to the full headless simulation API, CLI tool docs, EGRI integration patterns, and 8 compounding strategies for building on top of OpenRocket.
The Thesis
Physics simulations are the strongest class of EGRI evaluator: deterministic, fast, trusted, and structured. But the quality of the optimization is bounded by the mutation surface. A perfect evaluator over the wrong variables does not fail — it succeeds at telling you that you are looking in the wrong place.
145 trials. 297 seconds. Zero promotions. One motor swap: +531%.
The evaluator does not lie. The question is whether you are listening.