Four Days, $1,216 of Opus: Measuring a Solo Robotics Sprint on Claude Code Max

This post documents the resource envelope of a solo robotics sprint I ran between April 17 and April 21, 2026. It is not a retrospective and it is not a recommendation per se. It is a measurement — four and a half days of contiguous shipping on Claude Code Max, with Claude Opus 4.7 (1M context) as the primary planner — published because the cost-per-output numbers are now sufficiently uncommon that they are worth writing down, and because the working style that produced them maps closely to published guidance from Thariq Shihipar on managing the 1M context window. Where my practice diverges from that guidance I will mark the divergence explicitly.

The intended reader is another solo developer evaluating whether a similar operating envelope is feasible for their own work. I will try to make it easy to tell which claims generalize and which are specific to this sample.

1. What shipped

Commits and releases between 2026-04-17 11:23 PDT (first commit in robot-md) and 2026-04-21 19:00 PDT (time of writing):

robot-md — ROBOT.md format specification, JSON Schema, Python CLI (validate / render / context), four worked examples, SessionStart hook for Claude Code. 26 semver releases (v0.1.0 through v0.8.0) across the four days, including substantial bug-fix work on physical hardware: v0.7.2 fixed stereo-depth hole survival in HSV detection (painted LEGOs produce zero-depth pixels that used to break vision.find); v0.7.3 closed the extrinsic-bootstrap cliff with a motion-delta detector that does not project through any extrinsic; v0.8.0 added MCP observability (recent_invocations and recent_errors resources, streaming discover with per-step progress notifications, structured PreconditionFailure payloads with actionable suggested_fix hints). 273 commits.
robot-md-mcp v0.1 — TypeScript MCP server exposing ROBOT.md as resources to Claude Desktop and invoke_skill as a tool. Shipped over two calendar days (04-17 / 04-18). 26 commits.
robot-md-pendant v0.1 — handheld pendant project on a Waveshare ESP32-C6 Touch AMOLED 1.8” (~$35, onboard mic + speaker, Wi-Fi 6 + BLE5, single-core RISC-V) with a Python pendantd daemon on a Raspberry Pi 5. Three interaction modes land in v0.1: tap (configurable preset-button grid), speak (push-to-talk or a custom “Claude” wake word; Whisper → Claude → robot-md-mcp), and type. A heartbeat watchdog on the Pi halts the arm if the pendant disappears. The agent bridge uses the Claude Agent SDK (not Claude Code) on the daemon side. Firmware flashed over native USB-JTAG; a stuck BOOT button on one dev board ate most of an afternoon. Shipped in a single day (2026-04-20), merged from a firmware branch and a daemon branch. 64 commits on main; ~100 across both feature branches.
robotmd.dev — landing page, spec publication, RCAN 3.0+ version pin, registration module for minting RRNs via the Robot Registry Foundation.
A private planning repository: 1 commit.

Total: 364 commits across five repositories over 4.5 days. Daily breakdown by commit count (all repos combined): 04-17: 41, 04-18: 93, 04-19: 94, 04-20: 130 (pendant-day), 04-21: 7. Aggregated diff size across the three primary repos (robot-md, robot-md-pendant, robot-md-mcp): +80,935 insertions, −5,263 deletions. The insertion count includes tests, fixtures, lock files, bundled schema, and generated landing-page assets; hand-written-only would be substantially smaller. Ballpark: roughly 18K raw insertions per active day.

Ecosystem scope: one format specification, three runtimes (CLI, MCP server, pendant daemon), one firmware image, one documentation site.

Representative commit messages from the sprint, for reader calibration on the granularity of each unit of work:

feat!: precondition evaluator returns structured PreconditionFailure records
feat(pendantd): MCP stdio bridge with initialize/list/call
fix(pendantd): drop MCP stderr to DEVNULL to avoid deadlock on chatty servers
feat(init): auto-calibrate `ready` pose from DH + IK at init time
schema: tighten rcan_version pattern to 3.0+ only
ci(firmware): ESP-IDF build for esp32c6 + host protocol tests
test(integration): xfail test_v061_pipeline_end_to_end — unreachable IK target
plan: robot-md-mcp v0.1 implementation — 11 tasks

None of these is a megacommit. Each is a bounded unit of work that landed green.

2. Resource envelope

Token usage and cost, derived from ccusage over the interval 2026-04-17 through 2026-04-21 (inclusive, partial for 04-21):

Date	Total tokens	Output tokens	List-price cost	Opus 4.7 share of cost
2026-04-17	505.8M	1.59M	$338.21	100%
2026-04-18	571.5M	1.16M	$327.29	96.8%
2026-04-19	432.7M	1.21M	$276.25	98.6%
2026-04-20	376.5M	1.48M	$258.58	95.8%
2026-04-21*	10.97M	66.1K	$15.25	100%
Total	1.90B	5.51M	$1,215.59	~97%

*04-21 partial, to time of writing.

Three observations on the table.

List-price framing. I am on the Claude Code Max plan. The $1,215.59 is what the identical usage would have cost on the pay-as-you-go API; my actual out-of-pocket is a fixed monthly subscription. The headline number is therefore the consumption figure, not the cost figure. I am using it here because it is the only number that allows readers on other pricing tiers to calibrate against their own.

Cache read share. Of the 1.90B tokens, 1.86B (~98%) were served as cache reads. This is the single most load-bearing statistic in the table. On current Anthropic pricing, cache-read tokens are billed at 10% of the standard input price — a 90% discount (5) — and they avoid per-turn prefill latency. The commit cadence observed here is not feasible at any reasonable cost-to-latency curve if the bulk of context must be re-ingested each turn. The commit log therefore co-varies with cache-hit rate, not with raw token budget. This is consistent with Thariq Shihipar’s mantra “Prompt Caching Is Everything” (1) and I will return to it in Section 5.

Model distribution. Roughly 97% of cost was Opus 4.7 — a model Anthropic released on 2026-04-16, one day before the first commit in this sprint (6). Opus 4.7 ships the 1M context window at standard API pricing with no long-context premium, supports task budgets (a running-countdown abstraction the model uses to prioritize work inside an agentic loop), and — per Anthropic’s release notes — shows a 13% lift on coding benchmarks and 3× more production tasks resolved vs. Opus 4.6. The balance of my token spend (Sonnet 4.6 and Haiku 4.5) was subagent traffic. The primary planner was Opus throughout. Claude Code itself was on CLI version 2.1.117 at the time of writing.

Grepping my session JSONL files shows 545 Agent-tool dispatches (subagent calls) and 12 advisor() calls (§6.1). The 545 dispatches break down by subagent type as: 419 general-purpose (77%), 75 feature-dev:code-reviewer (14%), 49 superpowers:code-reviewer (9%), 1 Explore, 1 claude-code-guide. I expected Explore to dominate; it did not. In practice the general-purpose agent carries most bounded research and scaffolding work, and the two code-reviewer variants (23% combined) handle the review tax I used to pay as a human re-read of my own diffs. Explore appeared once — for a broad codebase question where I wanted an agent specialized for pattern-hunting; every other research task converged on general-purpose by default.

Pre-sprint baseline for comparison. Over 2026-04-10 through 04-16 (the seven days immediately preceding the sprint, before Opus 4.7 was available) my Claude usage across unrelated work was $437.54 total, 2.53M output tokens, ~$173 per million output tokens, 89.5% cache hit rate. The closest single-day Opus-on-Opus comparison point is 2026-04-14 — a high-Opus-share day on 4.6 — which ran at $257 per million output tokens with 85.1% cache hit. The sprint days ran at roughly $220 per million output tokens with 98% cache hit. On that apples-to-apples comparison, Opus 4.7 + sprint discipline produced a ~14% reduction in dollars per million output tokens and a 13-percentage-point lift in cache hit rate. The output-token efficiency delta is model-attributable; the cache-hit-rate delta is workflow-attributable. Both matter, but the second is the larger lever.

3. Design choice: why robot-md rather than extending OpenCastor

The scope shipped over these 4.5 days is only feasible because the project was defined to avoid building an agent harness. The relevant prior work is OpenCastor — a full harness for embodied AI that I maintain separately: reactive layer under 1 ms, tiered brain, safety kernel, RCAN router, Swarm orchestrator, Sisyphus self-improvement loop, 7,800+ tests. OpenCastor is appropriate for deployments where actions are physically irreversible and fleet-scale orchestration is required (7,334 tests at the time of writing). I have written about that framing in February.

ROBOT.md takes the inverse design decision. It is one declarative file at the robot’s root, read once at session start by whatever agent harness the user is already running (Claude Code SessionStart hook, Claude Desktop MCP, ChatGPT custom-GPT instructions field, Ollama SYSTEM directive, etc.). For single-robot setups, the built-in Bash tool of the existing harness becomes the dispatch path — lsusb for discovery, serial writes to a servo bus, curl to a camera’s HTTP endpoint. There is no separate runtime process.

Three tiers are defined: Tier 0 (Claude Code + ROBOT.md alone), Tier 1 (add user scripts), Tier 2 (add OpenCastor for production / regulated deployments). The composition rule is additive — users do not re-author the manifest when graduating between tiers.

The four-day scope would not have been achievable for an OpenCastor-sized effort. It was achievable for ROBOT.md because the zero-friction decision eliminated the layers of runtime work that a harness requires. I do not argue this generalizes beyond the category; I argue only that for single-robot and bring-up workflows, the harness is often mass the robot does not need to carry.

4. The Claude Code features that were load-bearing

The workflow was Claude Code on the Max plan, with Opus 4.7 (1M context) as the primary planner. Several features of Claude Code that shipped or matured in the preceding two quarters were load-bearing for the cadence. I list them not to be comprehensive, but to make explicit which surfaces the measurement depends on:

SessionStart hooks. A shell script whose stdout is prepended to the session context. ROBOT.md’s own integration story is this hook — which means the feature I was building is the feature I was using. I also use the hook to surface project memory, a git snapshot, and an auto-memory index.
Skills. Versioned, opinionated process modules (e.g. superpowers:brainstorming, superpowers:writing-plans, superpowers:executing-plans, superpowers:subagent-driven-development, superpowers:verification-before-completion, superpowers:systematic-debugging). Invoked by name; fire preconditions match task shape.
Subagents. The Agent tool dispatches specialized agents (general-purpose, feature-dev:*, Explore, code-reviewer variants, and custom user-defined types) with their own context windows. Final synthesis returns to the parent; intermediate output stays in the subagent. In this sprint, 77% of dispatches went to general-purpose; see §2 for the full breakdown.
Deferred tools / ToolSearch. The tool namespace is now large enough that eager loading would waste thousands of tokens per turn. Deferred tools are fetched by name on demand.
Auto-memory. Persistent files under ~/.claude/projects/<slug>/memory/ hold typed records (user / project / feedback / reference). A session started during this sprint inherited the fact that v0.8 had shipped, that the SO-ARM101’s wrist-flex servo stalls at high sustained angles, and — most load-bearing — a captured feedback memory titled “Zero-friction first, optional precision later” whose body records the exact user correction that formed the principle: “too many steps for most users” and “optional accuracy gains for quicker improvements for teaching can be implemented later.” That correction happened in an earlier brainstorm session. Memory made it a standing rule; every subsequent design decision in the sprint (v0.7.3 extrinsic-free bootstrap, v0.8.0 doctor warnings rather than errors, the Tier 0/1/2 composition rule itself) tests against that rule without re-litigation.
Opus 4.7 with 1M context. Both the capability and the context extension; see Section 5 for session-management implications.

The ship dates cluster tightly. Claude Code reached general availability in May 2025 alongside Claude 4. The Skills system shipped as skills-2025-10-02 beta in October 2025. Opus 4.7 with the 1M context window and task budgets landed on April 16, 2026 — the day before this sprint began. The “preceding two quarters” framing is literal: every load-bearing feature above was publicly available before October 2025 at the earliest and had fewer than seven months of field maturity at the time of the sprint. The Max plan makes the economics possible. The harness improvements, all shipped inside an eleven-month window, make the cadence possible.

5. The observed workflow vs. Thariq Shihipar’s recommendations

Thariq Shihipar, on the Claude Code team at Anthropic, has published specific guidance for managing the 1M context window. His central claim is that the 1M window is “a double-edged sword” — it allows more complex tasks but invites context pollution when sessions are not managed (2). Anthropic’s session-management documentation elaborates the toolkit (3), with Shihipar defining the problem directly: “Context rot is the observation that model performance degrades as context grows because attention gets spread across more tokens, and older, irrelevant content starts to distract from the current task.”

I will walk through his recommendations one by one and describe what I actually did over these 4.5 days. Where my practice diverges, I flag it.

5.1. /rewind (Esc Esc) over inline correction. Shihipar’s guidance: when an attempt fails, jump back and re-prompt rather than correcting in place. Observed practice: used consistently on implementation missteps, less so on design missteps (where I let the advisor call, Section 6, do the correction). Net alignment: high.

5.2. /compact with steering hints. Shihipar’s guidance: when compacting, supply direction — e.g. /compact focus on the auth refactor, drop the test debugging. Observed practice: partial alignment. On long sessions I used /compact with a one-line hint. On shorter sessions I relied on /clear and written plans to restore context from disk.

5.3. Proactive compaction. Shihipar’s guidance: compact before auto-compaction triggers, on the premise that “due to context rot, the model is at its least intelligent point when compacting” — and that “with one million context, you have more time to /compact proactively with a description of what you want to do.” Observed practice: divergence. I did not compact proactively. My approach was to terminate the session early, commit, write an updated plan to disk, and resume in a new session. The written plan is the compaction; the session does the execution. This achieves Shihipar’s intent by a different mechanism, and I suspect it is strictly better for multi-day projects because the compressed state is reviewable and diffable. It is strictly worse for short, exploratory tasks where writing a plan is overhead.

5.4. /clear at task boundaries. Shihipar’s guidance: “when you start a new task, you should also start a new session.” (3) Observed practice: partial alignment, not what I initially assumed. ccusage session over this interval shows one long-running parent session at my root cwd accounting for the bulk of consumption (~2.1B aggregated tokens), plus five subagent-aggregation entries for dispatched work. I ran fewer top-level sessions than I would have predicted; the 1M context window let a single planner carry state across multiple task boundaries that I would previously have broken into separate sessions. I now read Shihipar’s guidance as: start a new session when the task’s context is not relevant to the next task. Where tasks compound (spec → CLI → MCP server → landing page, all referencing the same manifest format) the 1M window can legitimately hold them together. Where they don’t, start fresh.

5.5. Subagent delegation for disposable intermediate output. Shihipar’s decision framework: “The mental test we use at Anthropic: will I need this tool output again, or just the conclusion?” If only the conclusion, delegate. Observed practice: high alignment. A grep of the sprint’s session JSONL files counts 545 Agent-tool dispatches across the 4.5 days — 419 to general-purpose (bounded research, scaffolding, one-off code writes I did not want polluting the parent context), 124 across the two code-reviewer variants (review tax on my own diffs), 1 each to Explore and claude-code-guide. The Sonnet 4.6 and Haiku 4.5 lines in the stats table map to these subagent calls.

5.6. Spec-based session design. Shihipar’s recommendation: start with a minimal spec, have Claude interview you with the AskUserQuestion tool, then open a new session to execute the spec (4). Observed practice: high alignment, implemented via the superpowers:brainstorming skill rather than raw AskUserQuestion. The pendant project and the MCP server both went spec → interview → new session → execute. In both cases the interview surfaced decisions I had not written down (pendant: firmware and daemon are separate repos; MCP: the tool surface should land invoke_skill, not run_skill).

5.7. Prompt caching as a primary optimization target. Shihipar’s framing: “Prompt Caching Is Everything” (1). Observed practice: this is what the 98% cache-read number at the top of the post measures. The engineering discipline behind it is mostly negative: do not disturb the cache. That means ordering tool calls so stable context comes first, not rewriting files that sit at the top of the context window, not shuffling the SessionStart hook payload between turns, and preferring Edit over Write for files that already exist.

Aggregate: my observed practice was fully aligned with Shihipar’s recommendations on four of seven points (/rewind, subagent delegation, spec-based design, prompt-caching discipline), partially aligned on two (/compact steering, /clear at task boundaries — see §5.4 for the revised reading), and divergent on one (proactive in-session compaction, which I substituted with disk-persisted plans). The compaction divergence is the only one I would argue generalizes; the partial-alignment gaps are empirical observations of how the 1M window changed my behavior, not a disagreement with the principles.

6. Additional practices not covered by Shihipar’s Claude Code guidance

Three practices were load-bearing for this sprint and are not directly covered by the Claude Code session-management literature:

6.1. Advisor checkpoints before pivots. Before any non-trivial change of approach, I invoked an advisor() call — a tool that forwards the full session to a stronger reviewer model. Grep of the session JSONLs counts 12 advisor calls across the 4.5 days, roughly one every nine hours of active work. On 2026-04-21 I was about to commit to grasp-confirmation as the v0.9 theme when the advisor flagged the outstanding RCAN 3.0 compliance audit as the higher-priority blocker. The resulting cross-repo audit identified eight specific gap categories between what robot-md v0.8.0 declared (rcan_version: "3.0" in every shipped manifest) and what it implemented: no payload signing at all (register still POSTs plain JSON), schema admitting 2.1/2.2.x manifests (fixed in commit cf25f6a before the spec was written), nullable fria_ref making EU AI Act Annex-III gating impossible, and entirely absent schema fields for §23 Safety Benchmarks, §24 IFU, §25 PMM, and §26 EU Register. The advisor call produced three durable artifacts within a few hours: an updated roadmap spec, a v0.9 compliance design spec naming v1.0.0 as the compliance declaration, and an implementation plan for v0.9.0 (RCAN-py dependency + schema slots, additive, no enforcement). Cost: roughly 90 seconds of advisor latency. Counterfactual cost of not pivoting: approximately one week of mis-sequenced work, plus the credibility risk of shipping additional features on top of an under-implemented 3.0 claim.

6.2. Verification-before-claim discipline. I run the test, read the output, then update task status — enforced by a skill. A nontrivial fraction of bugs that would previously have surfaced in CI or in production surface instead within 90 seconds of writing the code. I cannot quantify this from the commit log alone but I flag it because it is the single practice I would recommend to a reader who takes nothing else from the post.

6.3. Commit cadence as a recoverability mechanism. Eighty-one commits per day is not aesthetic. It is the cheapest available backup-and-rollback primitive, and its value scales nonlinearly with the rate at which the model is generating changes. Every green state is a commit. The cost of committing is negligible. The cost of not committing shows up the first time a local change is wrong in a way that is hard to diff against prior state.

7. Limits of this measurement

Four caveats on how far the measurement generalizes:

N = 1 sprint, one operator, one project type. The numbers are a sample, not a distribution. A robotics-spec project is unusually well-suited to a 1M context and to cache-oriented workflows; a research-heavy or debugging-heavy task would likely show different ratios.
Plan pricing is volatile. The Max-plan economics that make this feasible are Anthropic-specific and may move. The ratio of cache-read to total tokens should generalize to any provider supporting prompt caching at comparable discount.
Opus 4.7 vs. 4.6 is partially separable. The pre-sprint comparison in §2 attributes the dollar-per-output-token improvement (~14%) largely to the model upgrade and the cache-hit-rate jump (85% → 98%) largely to workflow changes. These are correlation estimates, not controlled A/B results — the tasks differed and the sample sizes are small. The claim is directional, not quantitative.
Published recommendations are a moving target. Thariq’s specific commands (/rewind, /compact, /clear) are features of the current Claude Code; the principles behind them (prune, delegate, compact proactively) should outlast the commands.

8. Takeaways for a reader considering the same operating point

In order of confidence that the claim will transfer:

Cache-read rate is the primary cost-and-latency lever. Measure it. 98% is a useful target; below ~90% the economics degrade quickly on any pay-as-you-go plan. Reaching it is mostly a matter of not disturbing stable context.
Delegate intermediate output to subagents. Apply Shihipar’s test: if you will not need the raw output again, the operation belongs in a subagent.
Start a new session when the context is no longer relevant, not at every task boundary. This is a revision from my original reading of Shihipar’s guidance. The 1M window in Opus 4.7 legitimately carries compounding, shared-context tasks (spec → CLI → docs) across what would previously have been session breaks. Reserve /clear for genuine context shifts.
Write plans to disk before long sessions; let the disk be your compaction layer. This diverges from Shihipar’s proactive /compact advice and I think it is the correct divergence for work that will span multiple days.
Verify before claiming done. Enforced by tool or by habit. This is the single practice with the highest confidence of transfer.
Use a stronger reviewer (advisor / second model) at pivot points only. The value is concentrated in low-frequency, high-stakes decisions.
Opus 4.7 is worth the premium if you are planner-bottlenecked rather than cost-bottlenecked. The Max plan makes this choice availability-limited rather than cost-limited.
If you ship a spec rather than a runtime, four days can cover a surprising amount of ground. This is a choice about the project, not about the tooling.

Four and a half days; 1.90B tokens, of which 1.86B were cache reads; $1,215.59 of list-price Opus 4.7 against a flat Max subscription; 364 commits across five repositories; one specification, three runtimes, one firmware image, one documentation site.

The specific numbers are unlikely to generalize. The ratios probably will.

Appendix: reproducing these numbers

For readers who want to compute the same measurements on their own sessions, the exact commands used here are:

# Daily Claude cost / token breakdown (JSON)
npx ccusage@latest daily --since 20260417 --until 20260421 --json

# Count Agent tool dispatches (subagent calls) in session files
find ~/.claude/projects/<slug> -name '*.jsonl' -newer <ref-file> \
  | xargs grep -h '"name":"Agent"' | wc -l

# Count advisor() calls
find ~/.claude/projects/<slug> -name '*.jsonl' -newer <ref-file> \
  | xargs grep -h '"name":"advisor"' | wc -l

# Per-day commit cadence for a repo
git log --since='2026-04-17' --format='%ci' \
  | awk '{print $1}' | sort | uniq -c

The ccusage CLI reads from ~/.claude/projects/**/*.jsonl and is on npm as ccusage (7). Project slugs in ~/.claude/projects/ are derived from your working-directory path with slashes replaced by hyphens (e.g. /home/craigm26 → -home-craigm26). All numbers in this post were generated from the above commands.

Sources

Thariq Shihipar, “Prompt Caching Is Everything” — x.com/trq212/status/2024574133011673516
Thariq Shihipar, on the 1M context window as a double-edged sword — x.com/trq212/status/2044653083817660431
Anthropic, Using Claude Code: session management and 1M context — claude.com/blog/using-claude-code-session-management-and-1m-context
Thariq Shihipar, spec-based session workflow with AskUserQuestionTool — x.com/trq212/status/2005315275026260309
Anthropic, Claude API pricing (prompt caching, cache-read discount) — platform.claude.com/docs/en/about-claude/pricing
Anthropic, What’s new in Claude Opus 4.7 (released 2026-04-16) — platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7
ccusage on GitHub — github.com/ryoppippi/ccusage

ROBOT.md lives at RobotRegistryFoundation/robot-md. The ecosystem it composes with: OpenCastor (runtime), rcan.dev (wire protocol), robotregistryfoundation.org (registry).