The Robot That Tunes Itself: OpenCastor's Harness Optimizer

There’s a class of problem in robotics that nobody talks about much: hyperparameter tuning for the agent layer. You spend weeks picking the right motors, tuning PID loops, calibrating sensors — and then you drop an LLM on top and hope the defaults are fine.

They’re usually not.

How aggressively should the model reason before acting? (thinking_budget) How much past context should it carry? (context_budget) When should it give up on a failing action and try something different? (retry_on_error, max_iterations) At what dollar threshold should it refuse to run a cloud model call? (cost_gate_usd)

These questions have real answers. They’re just different for every robot, every task profile, and every hardware tier. And until now, OpenCastor shipped one set of defaults for everyone.

The autoresearch pipeline

Today we shipped the first iteration of a nightly harness optimizer. Here’s how it works:

Generate — Gemini 2.0 Flash proposes 8 candidate harness configurations, each tweaking 1–3 parameters from the current champion
Evaluate — each candidate runs against 30 synthetic scenarios (navigation, manipulation, multi-step reasoning, error recovery, P66 constraint handling)
Rank — a weighted scorer evaluates success rate (50%), P66 compliance (25%), token efficiency (15%), and latency (10%)
Promote — if a candidate beats the current champion by >5%, it becomes the new champion and gets pushed directly to default_harness.yaml

No branches. No PRs. No manual approval step. If it wins, it ships.

The first champion: lower_cost — dropping cost_gate_usd from $0.05 to $0.01. Score: 0.9101 vs the baseline of 0.0. Turns out more aggressive cost control actually improves P66 compliance (fewer runaway cloud calls) and doesn’t meaningfully hurt success rate on the tested scenarios.

What’s next: your hardware profile

The generic optimizer is useful. But every robot is different. A Pi 5 with a Hailo-8L NPU has completely different optimal settings than a Pi 4 with 4 GB RAM. The NPU offloads inference — so thinking_budget can probably be lower because the model doesn’t need to reason as hard before acting. The Pi 4 needs a tighter cost_gate_usd and smaller context_budget just to stay responsive.

The next step is obvious: run the optimizer per hardware profile, seeded with hardware-aware candidates. And get data from real robots running castor contribute.

When your robot idles with contribute enabled, instead of (or alongside) donating to a BOINC science work unit, it can run harness evaluation scenarios matched to its exact hardware combo. That data feeds the nightly pipeline. If enough Pi 5 + Hailo robots have submitted results, the pipeline produces a champion config tuned specifically for that hardware — and pushes it back to every robot with that tier.

More detail in the full blog post on opencastor.com.

The harness editor got a lot better

Separately: the visual harness editor in the Flutter client had some rough edges. Today’s update fixes the biggest ones:

Smart placement — when you add a skill block, it goes after context/memory layers, before model layers. When you add a model block, it goes after skills. The pipeline order is enforced, not random.
Zoom controls — finally. +/−/fit buttons on the flow canvas.
Drag-to-connect — tap a node, tap another, edge created. Tap blank canvas to cancel.
Block info overlay — every block has an ℹ️ that explains what it does, why it’s where it is, and what you can configure.
Connector labels — tap any edge midpoint to label it YES/NO/loop/error/timeout/data/fallback, or delete it.

The editor is still not perfect — there’s more to do on manual layout freedom — but it’s significantly less confusing.

The bigger picture

The theme of the last few days of OpenCastor work has been: close the loop. The robot generates telemetry → telemetry feeds evaluation → evaluation finds better configs → configs improve the robot → the robot runs better and generates better telemetry.

The harness optimizer is one loop. The contribute fleet evaluation is a wider loop that spans the whole user base. Both run nightly. Both push directly to main when they find something better.

That’s the goal: a robot runtime that gets measurably better over time, automatically, because the infrastructure makes it trivial to evaluate, compare, and promote improvements.

The infrastructure already exists. The loops are closing.