Rocket Simulation Meets Recursive Improvement: 145 Trials, Zero Promotions, and What That Actually Proves

We ran 145 optimization trials in 297 seconds and got zero promotions. Then we changed one variable the optimizer wasn't allowed to touch and got a 531% altitude gain. That contrast is the entire lesson.

We took OpenRocket — an open-source model rocket simulator — stripped out the GUI, and wired it into an EGRI (Evaluator-Governed Recursive Improvement) loop. The pipeline: a headless CLI that outputs structured JSON, a Python harness that runs grid sweeps with constraint checking, a JSONL ledger that records every trial, and a formal problem spec that defines the optimization contract. We then expanded the mutation surface across 5 motor configurations, 15 rocket designs, and extended parameter ranges to measure exactly where the altitude sensitivity lives.

The Stack

Layer	Tool	Role
Evaluator	OpenRocket core (Java 17)	6DOF physics simulation, 2.06s avg per trial
CLI	`rocket-sim`	Headless interface: `run`, `info`, `sweep`, `events`
Harness	EGRI loop (Python)	Grid sweep, constraint checking, JSONL ledger, auto-promotion
Problem spec	`problem-spec.yaml`	Formal EGRI contract with objective, constraints, budget
Skill	`openrocket-sim`	Reusable agent skill for future sessions

Why Simulation Is the Ideal EGRI Evaluator

EGRI's core law: never grant more mutation freedom than your evaluator can reliably judge.

A physics simulation satisfies this perfectly:

Deterministic — same inputs always produce the same outputs (we verified: baseline altitude = 50.538m across runs)
Fast — 2.06s average per trial, enabling 145 trials in under 5 minutes
Trusted — the evaluator is 6DOF physics, not a heuristic or LLM judge
Structured — outputs are typed scalars: altitude, velocity, Mach, acceleration, flight time, ground hit velocity
Constraint-checkable — hard limits enforced per trial, violations logged to ledger

This means we can safely run in auto-promote mode — the strongest autonomy level in EGRI. No human gate needed. The evaluator is the gate.

The Rocket Under Test

The subject: "A simple model rocket" — a single-stage design with an Estes A8-3 motor.

Component Tree:
├── Nose cone    — Ogive, 100mm, 13g
├── Body tube    — 300mm × 26mm OD, 15g
│   ├── Parachute      — 300mm diameter
│   ├── Shock cord     — 1g
│   ├── Wadding        — 2g
│   ├── Launch lug
│   ├── Trapezoidal fins (×3) — 30mm span, 6g
│   ├── Centering rings (×2)
│   └── Inner Tube (motor mount) — 75mm, A8-3 motor
└── Total dry mass: ~46g

Flight timeline (A8-3 motor, sea level, no wind):

Event	Time
Launch / Motor ignition	0.000s
Lift-off	0.105s
Launch rod clearance	0.247s
Motor burnout	0.730s
Apogee (50.5m)	3.467s
Ejection charge	3.730s
Parachute deployment	3.731s
Ground hit (4.7 m/s)	15.870s

The motor burns for just 0.73 seconds. Everything after that is ballistic coast + parachute descent. This is a 16-second flight where the first 730 milliseconds determine everything.

What We Built

rocket-sim CLI

Four commands, all outputting structured JSON:

# Inspect rocket component tree + motor configurations
rocket-sim info "A simple model rocket.ork"
# → 5 flight configs: [A8-3], [B4-4], [C6-3], [C6-5], [C6-7]

# Run simulation with specific motor config
rocket-sim run "A simple model rocket.ork" 0    # A8-3 motor
rocket-sim run "A simple model rocket.ork" 1    # B4-4 motor

# Parameter sweep (single variable, N data points)
rocket-sim sweep "A simple model rocket.ork" wind_speed 0,2,5,8,10

# Flight event timeline
rocket-sim events "A simple model rocket.ork"

The CLI wraps OpenRocket's headless core module — a pure Java 17 simulation engine with zero GUI dependency. OpenRocketCore.initialize() bootstraps in one call; each Simulation.simulate() runs the full 6DOF solver.

EGRI Problem Spec

The optimization was formalized as a complete EGRI contract:

name: "rocket-optimization"

objective:
  metric: max_altitude_m
  direction: maximize
  secondary_metrics: [ground_hit_velocity_ms, max_velocity_ms, flight_time_s, max_mach]

constraints:
  - "ground_hit_velocity_ms <= 10.0"   # Safe recovery
  - "max_mach < 1.0"                   # Subsonic only
  - "flight_time_s > 5.0"              # Minimum flight time
  - "runtime_s <= 30"                  # Per-trial timeout

artifacts:
  mutable: artifacts/current_params.json    # Launch parameters
  immutable: rocket-tools.jar, .ork files   # Evaluator + designs

promotion:
  policy: keep_if_improves
  threshold: 0.5  # Must improve altitude by >0.5m

autonomy:
  mode: auto-promote
  escalation_triggers:
    - constraint_violation_detected
    - 10_consecutive_trials_no_improvement
    - budget_90_percent_exhausted

budget:
  max_trials: 100
  time_per_trial_s: 30
  total_time_s: 3600

Key design decisions:

Auto-promote mode because the evaluator is deterministic physics — no risk of gaming
0.5m promotion threshold to filter noise (the stochastic wind model adds ~0.1m variance)
Escalation triggers so the loop knows when to ask for help expanding the mutation surface

Phase 1: The Grid Sweep (145 Trials)

We swept 4 launch parameters across 144 combinations plus 1 baseline:

Parameter	Values	Count
Rod length	0.5, 1.0, 1.5, 2.0 m	4
Rod angle	0, 5, 10 deg	3
Launch altitude	0, 500, 1000, 2000 m ASL	4
Wind speed	0, 2, 5 m/s	3
Total	4 × 3 × 4 × 3	144

Execution Metrics

Metric	Value
Wall clock time	297.2s (4.95 min)
Total simulation time	299.2s
Average per trial	2.063s
Throughput	0.49 trials/sec
Promotions	0
Constraint violations	0
Ledger format	JSONL, 145 entries

Statistical Summary (All 145 Trials)

Metric	Min	Max	Mean	Std Dev	Range
max_altitude_m	50.408	50.853	50.664	0.086	0.445
max_velocity_ms	29.181	29.293	29.244	0.022	0.112
max_acceleration_ms2	143.617	143.698	143.667	0.015	0.081
max_mach	0.086	0.087	0.086	0.0004	0.001
flight_time_s	15.856	15.964	15.918	0.021	0.108
ground_hit_velocity_ms	4.414	4.841	4.632	0.081	0.427
runtime_s	1.915	2.473	2.063	0.090	0.558

The altitude range across all 144 candidates: 0.445 meters. Standard deviation: 8.6 centimeters. The promotion threshold was 0.5m. No candidate ever exceeded it.

Parameter Sensitivity (Altitude Delta per Variable)

Parameter	Best Value	Worst Value	Delta
Launch altitude	1000m ASL	2000m ASL	0.068m
Rod length	2.0m	0.5m	0.060m
Wind speed	0 m/s	5 m/s	0.027m
Rod angle	10°	5°	0.010m

Every parameter's sensitivity is measured in millimeters. The dominant variable — rod length at 60mm — is barely distinguishable from simulation noise. Wind speed at 27mm effect is within the stochastic variance band.

The grid did its job. It exhaustively proved that launch parameters are a dead mutation surface for this rocket. That is a measurement, not a failure.

Phase 2: Expanded Sweeps (What the EGRI Loop Told Us to Do)

The null result triggered the escalation logic: 10+ consecutive trials without improvement. Following EGRI protocol, the next move is clear — expand the mutation surface. We did this in three steps: push existing parameters to wider ranges, test motor configurations, and benchmark across rocket designs.

Individual Parameter Sweeps (Extended Range)

Wind speed (0–10 m/s, 11 data points) — the first parameter that actually moves the needle when pushed to extremes:

Wind (m/s)	Altitude (m)	Ground Hit (m/s)	Delta from Calm
0	51.14	4.16	—
1	50.96	4.33	-0.18
2	50.65	4.67	-0.49
3	50.30	5.40	-0.84
4	49.65	5.40	-1.49
5	48.68	6.69	-2.46
6	47.29	7.35	-3.85
7	47.58	9.12	-3.56
8	46.53	9.26	-4.61
9	46.04	8.91	-5.10
10	45.70	9.82	-5.44

At 10 m/s wind, altitude drops 10.6% and ground hit velocity approaches the 10 m/s safety constraint. The relationship is non-linear — the degradation accelerates past 4 m/s.

Rod angle (0–30°, 9 data points) — strong effect at extreme angles:

Angle (°)	Altitude (m)	Delta	Velocity (m/s)
0	50.71	—	29.25
2	50.57	-0.14	29.29
5	49.87	-0.84	29.28
10	48.22	-2.49	29.37
15	45.73	-4.98	29.48
20	42.68	-8.03	29.67
25	38.98	-11.73	29.89
30	35.37	-15.34	30.18

At 30°, altitude drops 30% while max velocity actually increases — the rocket accelerates more on the tilted rod but wastes energy on horizontal flight. This shows why the original grid (0–10°) was too conservative to observe the effect.

Launch altitude (0–4000m ASL, 9 data points) — thinner air = higher altitude:

Altitude ASL (m)	Apogee (m)	Delta	Ground Hit (m/s)
0	50.65	—	4.47
500	50.99	+0.34	4.68
1000	51.17	+0.52	4.87
1500	51.63	+0.98	4.82
2000	51.79	+1.14	5.01
3000	52.47	+1.82	5.35
4000	52.93	+2.28	5.45

4000m ASL (roughly La Paz, Bolivia elevation) gains 2.28m — the only parameter that consistently improves altitude. Lower air density = less drag. But the effect is still dwarfed by motor selection.

The Motor Configuration Test (The Real Mutation Surface)

The simple rocket has 5 pre-configured motor options. We ran all 5 with identical launch parameters:

Motor	Altitude (m)	Velocity (m/s)	Mach	Flight Time (s)	Ground Hit (m/s)	vs. A8-3
A8-3	50.7	29.2	0.086	15.9	4.60	baseline
B4-4	135.1	53.2	0.157	37.9	4.66	+167%
C6-3	278.7	95.3	0.281	72.6	4.67	+450%
C6-5	316.5	95.3	0.281	83.5	4.48	+525%
C6-7	320.0	95.3	0.281	84.5	4.51	+531%

This is the table that proves the thesis. Switching from A8-3 to C6-7 yields a 269.3-meter gain — 605 times larger than the entire 0.445m range produced by sweeping all 144 launch parameter combinations. All five configurations pass every constraint: ground hit velocity stays under 5 m/s, Mach stays well subsonic.

The C6-3 vs. C6-5 vs. C6-7 comparison is also instructive: same motor impulse, different ejection delay. C6-7 (7-second delay) gains 41.3m over C6-3 (3-second delay) purely from better delay timing — the parachute deploys closer to apogee instead of during the coast phase.

Cross-Design Benchmark (15 Rockets)

We ran every example .ork file in the OpenRocket distribution:

Rocket Design	Altitude (m)	Velocity (m/s)	Mach	Flight Time (s)	Hit (m/s)
Pods (winglets)	31.9	24.7	0.073	10.3	4.99
Simple model (A8-3)	50.5	29.2	0.087	15.9	4.69
Clustered motors	57.9	32.4	0.095	13.8	6.33
Base drag hack	98.2	47.1	0.139	27.2	4.56
Three stage LP	270.6	78.5	0.232	72.6	4.50
Tube fin	282.5	119.4	0.351	75.6	4.43
Chute release	307.7	72.1	0.213	51.8	5.19
Pods (powered)	342.3	92.4	0.273	82.0	4.90
ARC payload	452.0	150.3	0.442	110.2	4.61
Dual parachute	593.1	135.4	0.398	65.1	4.72
Two stage HP	676.3	158.9	0.469	64.5	6.23
Parallel booster	1121.2	210.9	0.623	224.5	5.54
Airstart timing	1318.5	186.4	0.550	93.3	5.29
Sim extensions	2466.3	240.8	0.715	169.9	7.26
Sim scripting	2464.7	240.8	0.715	169.6	7.16

A 77× altitude range from 31.9m to 2466.3m — all from the same physics engine, same weather model, same evaluation pipeline. The only variables: rocket geometry, staging, motor selection, and recovery systems.

Wind Sensitivity by Design Class

Wind affects complex rockets far more than simple ones:

Rocket	Calm (m)	10 m/s wind (m)	Loss	Loss %
Simple model (A8-3)	51.1	45.7	-5.4	-10.6%
Three stage LP	277.2	203.0	-74.2	-26.8%
Two stage HP	679.6	642.9	-36.7	-5.4%

The three-stage rocket loses 74 meters to 10 m/s wind — 26.8% of its calm-air altitude. The two-stage high-power design is more wind-resistant at only 5.4% loss, likely due to higher velocity and shorter coast time. This is exactly the kind of cross-design insight an EGRI loop can surface when given the right mutation surface.

The Insight: Mutation Surface Determines Everything

Here is what the data proves, with measured numbers:

Mutation Surface	Range	Effect on Altitude
Launch parameters (what we swept)	0.445m	0.88% of baseline
Extended wind (0→10 m/s)	5.44m	10.6% of baseline
Extended rod angle (0→30°)	15.34m	30.3% of baseline
Motor selection (A8→C6-7)	269.3m	531% of baseline
Different rocket design (pods→sim ext.)	2434.4m	7634% of lowest

The 144-trial grid sweep exhaustively proved that launch parameters are a dead mutation surface for a simple model rocket. That null result is not a failure — it is the evaluator telling you the truth. Zero promotions means the optimizer has converged, and convergence at baseline means the mutable artifact does not contain the dominant variable.

This is the general principle: in any EGRI loop, the first question is not "how many trials?" but "what are we allowed to change?" If the mutable artifact does not contain the dominant variable, the loop will converge immediately to a local optimum that looks identical to the baseline. Running more trials on the wrong surface is pure waste.

EGRI Anatomy of This Run

For anyone implementing EGRI in their own domain, here is how each component mapped:

EGRI Component	This Implementation
Mutable artifact	`artifacts/current_params.json` — 4 launch parameters
Immutable evaluator	`rocket-tools.jar` + OpenRocket physics engine
Objective	Maximize `max_altitude_m`
Constraints	ground_hit ≤ 10 m/s, Mach < 1.0, flight_time > 5s
Promotion policy	`keep_if_improves` with 0.5m threshold
Search strategy	`grid_then_llm` (Phase 1: exhaustive grid, Phase 2: LLM-guided)
Autonomy mode	`auto-promote` (deterministic evaluator = no human gate)
Ledger	JSONL, 145 entries, every trial with full metrics + timestamps
Budget	100 trials max, 30s/trial timeout, 3600s total
Escalation	Triggered at 10 consecutive no-improvement trials

The escalation trigger fired at trial 10 and kept firing. In a fully autonomous system, this would have triggered mutation surface expansion automatically — adding motor selection to the mutable artifact set. We did this manually in Phase 2 to measure the effect cleanly.

What This Means for EGRI Generally

The rocket optimization is a clean proof of three EGRI principles:

1. A null result from a trusted evaluator is informative, not failed. Zero promotions across 144 trials proves that launch parameters are dominated. This is a measurement, not an error. The evaluator earned our trust by being deterministic physics — we believe the null result because we believe Newtonian mechanics.

2. The evaluator should outlive the mutation surface. We expanded from launch parameters to motor configs to cross-design benchmarks, all using the same rocket-sim evaluator. The evaluator infrastructure (CLI, JSON output, constraint checker, JSONL ledger) amortizes across every expansion. Build the evaluator first, then widen the search.

3. Grid search is the right first move when the search space is small. 4 × 3 × 4 × 3 = 144 combinations took 5 minutes and exhaustively covered the space. No heuristic, no LLM, no gradient. When you can enumerate, enumerate. Save the LLM-guided phase for spaces too large to grid — like component geometry or motor selection from a database of 500+ motors.

What Comes Next

The evaluator, harness, ledger, and constraint system are all in place. Every future expansion reuses the same pipeline. The mutation surface is ready to widen:

Motor database sweep — OpenRocket ships hundreds of motor definitions (Estes, AeroTech, Cesaroni). Grid-sweep the full catalog with constraint checking. This is the obvious next EGRI loop given what the data showed.
Component geometry EGRI — fin span, nose cone shape/length, body tube dimensions. This is where LLM-guided search earns its keep — the design space is too large for exhaustive grid.
Multi-objective Pareto — altitude vs. ground hit velocity vs. cost. Map the Pareto frontier and let the evaluator surface the tradeoffs.
Cross-design EGRI — optimize across rocket archetypes, not just parameters within one design. The 15-rocket benchmark is the seed data for this.

Install the Skill

npx skills add broomva/openrocket-sim

The skill gives any Claude Code session access to the full headless simulation API, CLI tool docs, EGRI integration patterns, and 8 compounding strategies for building on top of OpenRocket.

The Thesis

Physics simulations are the strongest class of EGRI evaluator: deterministic, fast, trusted, and structured. But the quality of the optimization is bounded by the mutation surface. A perfect evaluator over the wrong variables does not fail — it succeeds at telling you that you are looking in the wrong place.

145 trials. 297 seconds. Zero promotions. One motor swap: +531%.

The evaluator does not lie. The question is whether you are listening.