Skip to main content

OpenCastor v2026.3.10.1: Semantic Perception + Real Hardware

8 min read By Craig Merry
OpenCastor Robotics Embeddings Hardware HLabs Release

There’s a failure mode that most robot projects hit pretty quickly, and it’s not about the motors. It’s about memory.

The robot does a thing, it works, the world moves on. Twenty minutes later the same situation comes up — same corridor, same lighting, same obstacle pattern — and the robot treats it like it’s never happened before. Every perception is fresh. Every decision is cold-start. The robot isn’t learning from experience; it’s just repeating its prior art over and over until something goes wrong.

That’s the problem I wanted to fix in this release. And while I was at it, I wanted to solve the other nagging gap: real BLDC motor support.

EmbeddingInterpreter: Robots That Recognize Familiarity

The EmbeddingInterpreter is a semantic perception layer that sits between the camera and the brain. At its core, it’s pretty simple: encode every scene as a vector, store those vectors with context about what happened, and at query time retrieve the most similar past episodes and inject them into the prompt.

What makes it useful is the execution. The default backend is CLIP — openai/clip-vit-base-patch32 — which produces 512-dimensional embeddings that map images and text into the same semantic space. No API key. No network call. Runs on a Raspberry Pi 4B at ~80ms per frame. You pip install opencastor[clip] and it just works.

The architecture has three tiers:

Tier 0 (free, local): CLIP or SigLIP2. This is the default. 512 dims, CPU-only, zero cost. For most use cases — obstacle avoidance, navigation, object recognition — this is all you need.

Tier 1 (experimental, local): ImageBind or CLAP. ImageBind is Meta’s 6-modality model (RGB, depth, audio, text, IMU, thermal) — OpenCastor exposes three of those: image, text, and audio in the same 1024-dim space. It’s remarkable technology and it’s CC BY-NC, so check your use case. CLAP handles audio-text embedding via laion/clap-htsat-unfused.

Tier 2 (premium, API): gemini-embedding-2-preview with MRL (Matryoshka Representation Learning) — default 3072 dims, truncatable to 1536 or 768, minimum 128. Accepts PNG, JPEG, WAV, MP3 — magic-byte detected, not filename-based.

The backend is selected by the backend config key. "auto" (default) tries Gemini first; if unavailable, falls back to CLIP. "local_extended" tries ImageBind first, then CLIP. Or pin it explicitly in RCAN:

interpreter:
  backend: clip

How RAG context injection works

The interpreter hooks into TieredBrain.pre_think() and post_think(). Before the LLM call, the current camera frame is encoded and the top-k most similar past episodes are retrieved from the vector store. Those episodes get prepended to the system prompt:

[Semantic Memory — similar past situations]
  1. (91% match) Instruction: "narrow gap 0.4m" → Action: wait → Outcome: success
  2. (87% match) Instruction: "corridor blocked" → Action: request_help → Outcome: success
  3. (82% match) Instruction: "obstacle left 0.3m" → Action: turn_right → Outcome: success

The LLM sees this context and can now say “I’ve been in a situation like this before, and waiting worked.” That’s not fine-tuning. It’s retrieval. The robot’s behavior adapts without updating any weights.

After the LLM returns a Thought, the new episode is encoded and stored. The vector store at ~/.opencastor/episodes/ persists as embeddings.npy + meta.json — pure numpy, no extra ML stack required.

I added Prometheus metrics for the full encoding pipeline (opencastor_embedding_* labels by backend, modality, and error type), a Streamlit Embedding tab with live backend status and RAG preview, and a benchmark suite so you can compare latencies across tiers on your actual hardware.

The result: a robot that can recognize “this is a familiar situation” and apply learned strategies — without any API cost, on commodity hardware, with a single YAML option.

HLabs ACB v2.0: Actually Good Motor Control

The other half of this release is hardware support for the HLaboratories Actuator Control Board v2.0.

I’ve been running OpenCastor mostly on PCA9685 PWM controllers and Dynamixel servos. PCA9685 is great for cheap RC-style setups — it’s everywhere, it’s cheap, it works. But it has a ceiling. You’re doing open-loop PWM, you have no position feedback, and the moment you want to do anything with real torque control or precise positioning you’re stuck.

The ACB v2.0 is a different class of device. It’s an STM32G474 running a proper FOC (field-oriented control) loop for 3-phase BLDC motors. Supports 12V–30V at up to 40A peak. More importantly: it speaks CAN Bus at 1Mbit/s using the STM32’s dedicated FDCAN peripheral, not a bolted-on SPI transceiver. That matters for multi-joint robots because you can daisy-chain ten ACBs on a single CAN bus with 120Ω terminators at each end.

The OpenCastor driver does:

  • Auto-detection by USB VID/PID (port: auto scans /dev/tty* for known HLabs VID/PID combinations)
  • Dual transport: USB-C serial for single-axis setups, SocketCAN for multi-joint
  • Calibration flow: pole pairs → zero electrical angle → PID gains, via wizard or API
  • Real-time telemetry at 50Hz: position (rad), velocity (rad/s), current, bus voltage, error flags
  • Firmware flash via DFU mode: castor flash --id joint_0 --firmware <github-releases-url>

Three RCAN profiles ship out of the box:

  • hlabs/acb-single — single wheel or winch
  • hlabs/acb-arm-3dof — 3-DOF robot arm, CAN bus, nodes 1-3
  • hlabs/acb-biped-6dof — 6-DOF biped, CAN bus, nodes 1-6

The biped profile is speculative for now — I don’t have a full biped built — but the driver is solid and the CAN protocol is working. If you’re building a biped and want to use OpenCastor, I’d love to hear from you.

Install:

pip install opencastor[hlabs]  # adds python-can dependency

What calibration actually means in practice

The calibration flow is worth explaining because it’s not obvious. BLDC motors have a property called “pole pairs” — the number of electrical cycles per mechanical revolution. A motor with 7 pole pairs completes 7 full electrical cycles (360° each) for every 1 mechanical revolution. The controller needs to know this to run FOC correctly.

The zero electrical angle is the offset between the magnetic encoder’s zero position and the motor’s electrical zero. Without this, the controller doesn’t know where “phase A aligned” actually is, and your torque output will be wrong.

You provide pole_pairs in the config; the ACB determines the zero electrical angle during a short calibration sequence. Run castor wizard, answer two questions (pole pairs for your motor, CAN node ID if using bus), and it handles the rest. The encoder calibration step waits up to 10 seconds for a response. The result is a CalibrationResult JSON with success, zero_electrical_angle, pole_pairs, and pid_applied.

How They Connect

These two features are more connected than they look.

Consider a robot arm built on ACB v2.0 joints. The arm picks up objects, places them, occasionally fails — misses the target, drops something, hits a joint limit. Each attempt is recorded as an episode with an embedding of the camera view.

Next session, the robot reaches for an object. The EmbeddingInterpreter finds three similar past episodes: two successes from slightly different approach angles, one failure from the current angle. The LLM prompt now includes that context. The planner can say: “last time I approached from this exact angle, I hit the joint limit — let me rotate the base 15 degrees.”

That’s not magic. It’s just retrieval-augmented generation applied to robot experience. But it requires both pieces: an embedding layer that can recognize visual similarity, and a hardware platform that can actually execute precise corrections.

What’s Next

I want to improve the episode store with better indexing — right now it’s flat cosine similarity over a numpy array, and at 10k episodes the latency starts to show. HNSW or IVF indexing via FAISS is the obvious next step.

On the hardware side, I want to get the CAN bus transport tested on a real multi-joint setup. The driver already supports velocity, position, torque, and voltage control modes — the next step is real hardware validation.

The bigger thing I’m thinking about is multi-modal embeddings for the swarm. If you have three robots, each with its own episode store, there’s no reason they can’t share relevant memories. “Node alex encountered this obstacle configuration and turned right. Node bob is now in a similar situation.” That’s distributed episodic memory, and it’s doable with the current architecture.

pip install opencastor[clip,hlabs]   # CLIP + HLabs ACB support