What the AI productivity step-function looks like — and the substrate that produced it

Sam Altman says the world wants 1000x more software. Vercel CEO Guillermo Rauch says he's hiring people who know how to use AI — engineers who write good prompts and orchestrate agents to maximize results. They're describing the same inflection. Here's what it looks like in a single contribution graph — five commits in 2022 to 6,519 in five months of 2026 — and the open-source substrate that produced it.

May 5, 2026

13 min read·
agentsharness-engineeringcontext-engineeringinfrastructurebstacklife-agent-osopen-sourceproductivityrcs

"The world wants a gigantic amount more software, 100 times maybe a thousand times more software." — Sam Altman, Federal Reserve conference

Vercel hires "people who know how to use AI — who write good prompts and manage AI agents to maximize results." — Guillermo Rauch, Vercel CEO (Bloomberg Línea reel)

Two angles on the same thing. Altman is the demand side: orders of magnitude more software than humans-typing-alone can produce. Rauch is the supply side: hire operators who manage agents, not operators who out-type them. v0 hit 3,200 PRs merged per day — that's not a team scaling, it's a substrate scaling.

This post is what that inflection looks like in one engineer's contribution graph — and the open-source substrate that produced it.

That's broomva on GitHub, pulled from the GraphQL API on 2026-05-06. Five contributions in 2022. 6,519 in the first five months of 2026. The inflection wasn't gradual — it was a single month: February → March, 465 → 2,881, a 6.2× jump that sustained through April.

What started shipping that month is bstack — a portable harness metalayer of eleven irreducible primitives that turn any agent-driven workspace into a self-operating system. That's the aha: the multiplier isn't from the model getting smarter; it's from the substrate becoming load-bearing.

Try it in one command

If you build with Claude Code, Codex, Gemini CLI, OpenCode, or any of the 50+ agent CLIs the skills ecosystem supports, you can install the whole stack right now:

npx skills add broomva/bstack

Then, in your agent session:

/bstack bootstrap     → install all 28 skills + scaffold governance + wire hooks + run doctor
/bstack doctor        → verify the eleven-primitive contract (always exits 0; --strict for CI)
/bstack status        → show installed-vs-missing skills + harness health
/bstack repair        → apply targeted fixes for any gap doctor surfaces

bootstrap is the only one most people need. It scaffolds the governance files (CLAUDE.md, AGENTS.md, .control/policy.yaml), wires the hooks (Stop, PreToolUse, SessionStart), installs the 28 curated skills, and runs the doctor to confirm the primitive contract is satisfied. Idempotent: re-running it on an existing project never overwrites your customizations.

Costs you nothing if you don't use it. Gives you a self-operating workspace if you do.

The full eleven-primitive spec — what each closes, how it's enforced, and the four reasoning-enforced reflexive trigger rules — lives at references/primitives.md. The repo itself is broomva/bstack.

The rest of this post is why the substrate produced that 6.2× month — what each primitive closes, where bstack is ahead of the YC / Anthropic / harness-engineering discourse, and the dogfood evidence from the session that built this very post.

The graph in detail

Yearly contributions, public + private:

Year Contributions Phase
2022 5 dormant
2023 489 reactivation
2024 832 AI-amplification ramp
2025 2,763 sustained
2026 (5 months) 6,519 bstack era

Monthly breakdown of 2026, where the inflection is sharp:

Month Contributions
Jan 258
Feb 465
Mar 2,881
Apr 2,561
May (5 days) 354

For comparison: a productive engineer commits ~1,500–2,500 times per year on a typical year. The 2026 trajectory is on pace for ~15,000+ in the calendar year. That's not a 10× engineer; it's the productivity multiplier Altman and Rauch are pointing at, made personal and dated.

What changed in March

I'm not a different engineer than I was in January. The substrate is.

Three threads in the industry triangulated to the same place around end of 2025, and bstack is the answer they pointed at:

YC flipped its thesis. The Spring 2026 Request for Startups read: "AI-native companies can do expensive work faster." The Summer 2026 RFS reads: "AI becomes the operating system of the company." Of the W26 batch — 194 companies, the largest ever — 41.5% are building agent infrastructure: auth, testing, security, monitoring, context management, billing. The application-layer gold rush created demand for a platform layer. The defensible moat per RFS commentary is governance — who can act, what they can access, how you audit.

YC W26 batch composition: agent infrastructure dominates

Harness engineering became the dominant frame. Across Anthropic's three-agent harness, Google Scion's multi-agent hypervisor, MongoDB's platform metalayer, and Claude Code's reframing as "programmable runtime with 22 lifecycle events," the industry converged on a single sentence: "if you're not the model, you're the harness." Production rejected unconstrained swarms. The winning shape is bounded deterministic workflows with phase-gating — Plan-Execute-Verify loops, hard circuit breakers, human-on-the-loop. The numbers: 65% of enterprise AI failures trace to harness defects — context drift, schema misalignment, state degradation — not reasoning gaps.

Prompt engineering got renamed context engineering. Karpathy named it in June 2025. Anthropic formalized it. Context windows have n² attention scarcity — larger contexts mean worse recall, the field calls it context rot. The frontier moved from "stuff more in" to managed filesystem-like environments with explicit write/promote/retrieve/compress/forget operations. Mem0, Memory-R1, Mem-α, EverMemOS — each treats memory as a managed lifecycle, not a passive store. Gartner: by 2027, 80% of AI failures will be poor context management.

Three threads. One conclusion: reasoning is saturating; the environment is where the work is. The harness, the substrate, the workspace — these are the bottleneck. And once a single human operator gets a workspace that handles its own state, ships its own knowledge graph, drains its own CI queues, and refuses its own destructive commands when policy says no — the throughput ceiling moves up by an order of magnitude or more. That's what March's number is.

What bstack is

bstack is a portable harness metalayer. Eleven irreducible primitives that turn any agent-driven workspace into a self-operating system.

It's not a framework. It's not a SaaS. It's a curated set of skills you npx skills add broomva/bstack into your Claude Code, Codex, Gemini CLI, OpenCode, or any of the 50+ agent CLIs the skills ecosystem supports. Each primitive is a small, opinionated piece of substrate: a hook, a script, a reflexive rule, or a configuration block. Each one closes a specific failure mode that would otherwise drift into entropy across an unsupervised session.

The eleven primitives compose into one autonomous development loop:

The primitives composing the autonomous-development loop (diagram shows the original nine; P10 + P11 joined the contract through the parallel work this post describes)

One naming detail worth flagging up front: P6's skill repo is broomva/bookkeeping, not broomva/p6 — that one mismatch is historical. Every other primitive number lines up with its skill name (P9 → broomva/p9, etc.). Primitive numbers are sequential identifiers in the bstack itself; skills are npm-style packages.

The eleven primitives

# Primitive Closes the failure mode Mechanism
P1 Conversation Bridge Sessions are stateless; context evaporates between agent runs. Stop hook → JSONL transcript → structured Obsidian doc → vault. Bridge stamp must be < 24h stale.
P2 Control Gate Agents fluently run destructive shell commands the model didn't actually authorize. PreToolUse hook reads .control/policy.yaml. G1–G4 gates are blocking, never bypassed.
P3 Linear Tickets Autonomous work without tracking is invisible work. Linear MCP — every unit of work has a ticket. State must reflect actual state.
P4 PR Pipeline Code without review is hope. Branch → PR → CI → merge → deploy. Never merge with failing checks. Never --no-verify.
P5 Parallel Agents Sequential execution wastes time on independent tasks. git worktree per agent. Independent contexts merged via branches, not shared state.
P6 Knowledge Bookkeeping (broomva/bookkeeping skill) Knowledge graphs without quality control degrade into noise. 7-stage pipeline. Nous gate (novelty + specificity + relevance, ≥ 5/9) for promotion. Contradiction detection.
P7 Skill Freshness Check Skills installed via npx skills add are snapshots and silently rot between manual updates. SessionStart hook nudges when an update check is overdue (≥ 7d). Never blocks.
P8 Branch + Worktree Janitor Squash-merged branches accumulate forever; git branch --merged doesn't catch them. make janitor runs the canonical squash-detection idiom. Default --dry-run; protected branches always skipped.
P9 Productive Wait (broomva/p9 skill) Agents sleep on blocking waits (CI, deploys, builds) and waste 5–15 min per operation. Wait optimizer. PR CI is the canonical case: gh pr checks --watch via run_in_background, wait-queue drains during the wait, classifier + evaluator self-heal red CI, auto-merge actuator with governance pre-pass. Non-PR waits (push-triggered deploys, builds) get a single direct check after kicking off next-priority work — wiring those into p9 watch is on the roadmap.
P10 Worktree Hygiene Discipline Dirty trees, half-finished branches, and orphan worktrees become slow leaks of merge conflicts and "what was I doing?" amnesia across sessions. Reasoning-enforced rule: P5 supplies the mechanism (worktrees), P10 supplies the discipline — decide worktree-or-not before the first file, keep git status clean through the PR lifecycle, run P8 janitor immediately after merge.
P11 Empirical Feedback Loop Code that compiles, lints, and even passes CI, but never actually does the thing the user asked for in the deployed environment. Reflexive composition: parallel run_in_background log-tails, gstack/agent-browser E2E, before-and-after visual diff, multi-level test composition, deploy-preview screenshot capture. Reasoning isn't validation — interaction is.

Four of these — P7, P8, P10, P11 — are recent crystallizations. P7 + P8 were born from a session where existing primitives surfaced their own bugs (a stale install, orphan watcher rows). P10 and P11 followed from the next set of failure modes the loop exposed: dirty trees compounding across sessions (P10), and code that ships green-CI but doesn't behave correctly when actually exercised (P11). The system found its own gaps and shipped the fixes. That's the dogfood loop in §6.

Where bstack is ahead of the discourse

Four things bstack does that I haven't found in YC commentary, Anthropic's engineering blog, or the harness-engineering survey papers.

Formalized stability budgets

bstack's governance layer (CLAUDE.md + AGENTS.md + .control/policy.yaml) is the Level 3 controller in a Recursive Controlled Systems hierarchy with formal stability proofs.

Level System Controller Stability λ
L0 External plant Arcan agent loop (shell.rs) 1.455
L1 Agent internal Autonomic homeostasis controller 0.411
L2 Meta-control EGRI loop engine 0.069
L3 Governance CLAUDE.md + AGENTS.md + policy.yaml 0.006

Composite stability: λᵢ > 0 at all levels ⟹ exponentially stable (Theorem 1, p0-foundations). The L3 stability margin is narrow on purpose — governance changes consume budget. If you rewrite AGENTS.md every session, the system destabilizes. If you observe patterns across sessions and crystallize rules slowly, it converges. The math is what justifies the rule that says "governance changes are rare and deliberate" — it's not a stylistic choice, it's a stability constraint.

RCS hierarchy with stability budgets

Reflexive trigger rules

Most harnesses enforce policy through hooks. Hooks are rigid: they fire or they don't, they have no judgment. Anthropic's Effective context engineering recommends finding "the right altitude" — specific enough to guide, flexible enough to let the model judge.

bstack does this with reflexive trigger rules in AGENTS.md, written as binding-on-every-agent text. P6 has one ("Bookkeeping is reflexive — invoke without being prompted whenever you commit a feature that reads from the graph"). P9 has five ("After every PR push, after a non-PR deploy push, before any sleep, on red CI before re-pushing, on MERGE_READY before manual merge"). The rule loads at SessionStart; the agent reads it and applies it. No hook needed. If the agent finds itself sleeping on a blocking wait, the reflex failed and that's a session-level error worth flagging.

This isn't softer than hooks — it's deeper. Hooks gate one tool call. Reflexive rules shape the agent's whole approach to a class of work.

Productive wait

I searched the harness-engineering literature for "drain context-scoped queue while CI runs." It doesn't appear. P9 is genuinely novel.

The standard pattern is: agent pushes, agent sleeps, agent polls. The bstack pattern is: agent pushes, agent spawns gh pr checks --watch via run_in_background, then drains a priority-ordered queue (session > memory > graph > docs > linear) while the watcher runs. When the bg-task notification fires, the agent reads p9 status, transitions to MERGE_READY or runs the classifier. The wait time disappears.

This single session's evidence: 5 PRs shipped in parallel CI windows that would otherwise have been 5 sequential sleep 600s. That's an hour of clock time recovered, in one session. Multiplied across the months of March and April, that hour-per-session effect is a substantial chunk of the 5,400-contribution swing visible in the chart at the top of this post.

Knowledge graph at Layer 3

P6's bookkeeping treats knowledge as a managed lifecycle — same shape as Mem0, EverMemOS, Memory-R1 — but with bi-temporal metadata, contradiction detection, and a Nous gate that scores items on novelty (0–3) + specificity (0–3) + relevance (0–3). Items below 2/9 are discarded. Items 5/9 promote to Layer 3 entity pages. Items above 7 fast-path. Contradiction detection runs on every promotion.

The graph never holds garbage. That's the invariant.

The dogfood loop — evidence

Here's the thing I find most interesting: bstack was built by using bstack to build bstack.

This single session shipped:

  • 13 pull requests merged across 4 repositories
  • 108 tests added/modified in broomva/p9 (46 unit + 15 integration + 6 chaos + 25 fix-cycle + 16 actuator)
  • Matrix CI green on Python 3.11 / 3.12 / 3.13
  • Two new bstack primitives (P7, P8) added in response to bugs the system surfaced about itself — and two more (P10, P11) crystallized in the parallel work that ran while this post was being written
  • One architectural realignment (drop P3/P4 Spaces/Lago, renumber 9→7, then add 8 and 9 → 11 — a portable contract that the doctor enforces and the bootstrap scaffolds)

PR throughput: traditional sleep-then-poll vs bstack productive-wait

The interesting moments weren't the PRs themselves. They were the safety pre-passes triggering correctly:

  • The P2 control gate blocked my own git reset --hard mid-session — the agent that built the gate hit the gate, and the gate held.
  • The P9 auto-merge actuator self-blocked on PR #40 because it touched CLAUDE.md (governance-class path → require_human). The actuator I shipped two PRs earlier refused to ship the next PR. Manually approved with full context — exactly the intended human-in-the-loop behavior.
  • The P8 janitor I shipped at the end of the session dropped a stale feat/mil-tier2-foundations branch that had been sitting around for weeks, in the very next operation after merge.

A demo of the productive-wait pattern in action:

Demo: productive-wait drains the queue while CI runs

The agent never sleeps. The watcher runs in background. The agent pulls the next deferred work item, does it, and by the time the bg-task notification fires, the next PR is already half-prepared.

The honest accounting

The 6,519-contribution number is not me typing 6,519 times. It's me + Claude Code + bstack acting as a coupled system, where:

  • I make decisions, frame problems, and set direction
  • Claude Code does the actual code-writing labor at high throughput
  • bstack is the scaffolding that makes that throughput safe and reproducible

Without the substrate, you get the cautionary tale: an agent that runs git reset --hard because it thought it was a sandbox, ships untested code, accumulates worktree garbage, sleeps on CI for 10 minutes per PR, and produces a knowledge graph that's mostly noise. The number stays smaller because each failure costs a session's worth of recovery work.

With the substrate, those failure modes don't fire — the gate refuses the destructive command before execution, the auto-merge actuator refuses to merge governance-class paths without human approval, the janitor cleans up squash-merged branches automatically, the productive-wait pattern recovers the CI hour-per-session loss. Each primitive removes a specific entropy source. Compounded, they look like a step function.

This is, mechanically, what Altman and Rauch are describing when they say programmers are 10x more productive. The 10× is real but it isn't free. Without the substrate, you don't get to 10× — you get to 2× and a much higher rate of broken things. The substrate is the load-bearing piece that turns the multiplier from "intermittent burst" into "sustained operating regime."

What bstack doesn't solve yet

Honest list, in priority order:

  1. Per-agent identity and credential scoping. bstack treats agent ≡ user. Anthropic's three-agent harness and Google Scion both isolate per-agent. Tier C work.
  2. Sandbox isolation. P5 worktrees are git-isolation, not process-isolation. Daytona (running ~100 YC startups) and E2B (~10% of W26) handle this. bstack should wrap, not reinvent.
  3. Durable execution. Only P9 has a state.jsonl checkpoint. Agent intent and plan aren't checkpointed; a session crash loses them.
  4. Managed forget/compress. P6 promotes; it doesn't forget. Auto-memory grows unboundedly. Mem0 / EverMemOS have explicit forget ops we should borrow.
  5. In-session phase indicator. TAO and Plan-Execute-Verify expose "step N of M". We don't surface this; the agent re-derives every turn.
  6. Pre-CI test-driven execution. bstack does TDX in CI (P9); pre-CI it doesn't. "Run test → see error → refine" closed loop in-session would close this.

These are roadmap items, not weaknesses. Most of them are well within reach for someone reading this post who wants to contribute one.

Coverage map: paradigm gaps × bstack primitives

Why we share it freely

The Vercel CEO is right that a small number of operators with the right tools will outproduce traditional teams by orders of magnitude. The natural question is: if you've found one of those substrates, why share it?

Three reasons.

Compounding. Every operator that adopts bstack and runs into a primitive-shaped failure mode is a free debug case. P7 and P8 were born from a session where the existing primitives surfaced their own bugs; P10 and P11 followed in the next round. Deploy that across hundreds of operators and the substrate gets sharper faster than any closed team can match. The OSS distribution is itself a meta-primitive we haven't formally named — call it "the contributor loop is the harness's f₃ dynamics function."

Calibration. The Altman-Rauch-IEEE framing is real but it's also suspiciously easy to claim. "I'm a 100x engineer now" is unfalsifiable in private. Publishing the substrate opens it to comparison: anyone can install it, run their own session, and report back whether their throughput moved. If it doesn't, that's a real critique we should hear.

Distribution > extraction. The traditional shape — I have a productivity edge, I'll keep it private to gain market advantage — is the wrong shape for this regime. The market is shifting fast enough that the value isn't in the substrate's secret bits but in the operator's compound learning rate. Open-sourcing the substrate doesn't slow the operator down; it speeds up the field, which means the operator's edge is sharper relative to anyone trying to lag-follow.

Same shape Vercel is taking with v0 + the skills marketplace. Same shape Anthropic is taking with the engineering blog. The pattern is consistent: the operators who are visibly hitting the productivity inflection are the ones publishing their substrate, not hoarding it. We're trying to be in that group.

Closing

In one paragraph:

bstack is the body around the brain. Reasoning is saturating; the environment is the bottleneck. The 2026 paradigm rediscovered governance, harness engineering, and context engineering — three threads that all say "the work is in the substrate." bstack is one concrete answer: eleven portable primitives that any agent-driven workspace can install, with formal stability budgets at the governance layer, reflexive rules that bind without rigidity, a productive-wait pattern that recovers the time CI takes from your agent's life, worktree-hygiene discipline that keeps the tree clean across the PR lifecycle, and an empirical-feedback reflex that refuses to call work done until the agent has interacted with the deployed result. The chart near the top of this post is what the substrate produced when one operator turned it on.

The install is at the top. The full primitive spec is at references/primitives.md. The repo is broomva/bstack. The RCS formalization with stability proofs lives in research/rcs/papers/p0-foundations.

If you want to contribute, the gaps in §"What bstack doesn't solve yet" are real, scoped, and high-leverage. Especially the durable-execution and sandbox-integration ones — those would land in a single PR each and would close paradigm-level gaps.

The thesis I'm willing to defend: the next decade of agentic work happens in workspaces that are themselves intelligent. Not because the model got smarter, but because the substrate did. The contribution graph near the top of this post is what one operator's throughput looks like when the substrate is right. bstack is one substrate. Build a better one. Critique this one. But stop optimizing prompts in a system that drops state, never sleeps a CI loop, and accumulates worktree garbage. Build the body.


The chart at the top is generated from gh api graphql queries against the broomva user, run on 2026-05-06. Reproducible — anyone can pull their own contribution trajectory the same way. Sources for the industry framing: Sam Altman at the Federal Reserve conference; Guillermo Rauch on hiring AI-fluent engineers (Bloomberg Línea Argentina reel) and on v0's throughput (How I AI podcast, Feb 2026); IEEE Spectrum and Emergent VC on the 100x/1000x engineer framing.

Special thanks to the agents that built this with me. Their session transcripts are in ~/broomva/docs/conversations/, captured automatically by P1. The synthesis memory entry that drove this post is at project_bstack_paradigm_synthesis.md. Both will be checked at the start of the next session, and the next, and the next.

The dogfood loop is closed.

Reactions

broomva.tech

Reliability engineering for complex systems.

  • Pages
  • Home
  • Projects
  • Writing
  • Notes
  • Tools
  • Chat
  • Prompts
  • Link Hub
  • Social
  • GitHub
  • LinkedIn
  • X