ContinuonAI Lab Notes: Day 1 | Blog

This entry marks the initiation of the ContinuonAI development phase. The primary objective is to validate the proposed architecture through empirical testing in unstructured home environments. Rather than relying on theoretical simulations, the project focuses on validating each functional layer—perception, planning, tele-operation, and autonomy—through iterative, falsifiable hypotheses.

Project Objectives and Validation Strategy

ContinuonAI serves as the core intelligence layer for the ContinuonXR platform. The central thesis posits that a world-model architecture integrated with a hierarchical Continuous Memory System (CMS) can operate effectively within the constraints of consumer-grade hardware and dynamic real-world environments.

The validation strategy rests on three pillars:

Constrained Operational Scope: By focusing on discrete, measurable tasks (e.g., single-room navigation, object manipulation), we can isolate variables such as latency, memory utilization, and telepresence efficacy.
Hypothesis-Driven Development: Each module is developed to test a specific hypothesis. For instance, “A single-arm tele-op interface with compliant gripping is sufficient for non-rigid object manipulation.” Success or failure in these tests dictates the architectural evolution.
Edge-Cloud Symmetry: The architecture necessitates a symmetrical deployment model where tight control loops execute locally on a Raspberry Pi 5, while high-level planning and multimodal context processing occur remotely. Synchronization is managed via the ContinuonAI application, ensuring consistency across local and remote operations.

Hardware Baseline: Raspberry Pi 5 Integration

To establish a deterministic baseline for experimentation, the legacy DonkeyCar stack (based on NVIDIA Jetson Nano) has been retired in favor of the Raspberry Pi 5 (8 GB).

Performance Rationale: The Pi 5 offers superior single-thread performance and a simplified I/O architecture for camera integration, which is critical for the initial perception stack.
Software Ecosystem: The transition facilitates a standard containerized workflow for ROS 2 (Foxy/Humble). Crucially, this is treated as a Hardware Abstraction Layer (HAL) hypothesis. We are actively evaluating if the overhead of ROS 2 middleware introduces unacceptable latency for the high-frequency HOPE control loop compared to direct memory access. For now, it provides a convenient driver ecosystem, but the “Brain” remains agnostic.
Legacy Archival: The previous hardware configuration has been documented and archived to preserve backward compatibility if required.

This hardware reset provides a stable platform for the initial experimental series: visual line following, SLAM-based mapping, and tele-operation latency profiling.

The Genesis: LiveCaptionsXR and Spatial Intelligence

The transition to a spatial intelligence model is rooted in the development of LiveCaptionsXR. Originally designed as an accessibility tool for real-time captioning, LiveCaptionsXR has evolved into the primary interface for human-in-the-loop data collection and tele-operation.

In the context of ContinuonAI, the XR headset serves a dual purpose:

Spatial Data Ingestion: It captures high-fidelity, ego-centric video and depth data, enriched with human gaze and head-pose metadata. This data is essential for training the CMS to understand human-centric spatial context.
Immersive Tele-operation: It provides a low-latency interface for remote control, allowing the operator to demonstrate complex manipulation tasks. These demonstrations form the “expert trajectories” used to bootstrap the on-device policy learning.

LiveCaptionsXR effectively bridges the gap between “blind” robotic actuation and true spatial understanding, providing the semantic grounding required for the HOPE architecture.

Semantic Augmentation: On-Device LLMs and Modularity

While HOPE manages the low-level policy and state evolution, high-level semantic reasoning is augmented by Gemma 3n, deployed on-device via flutter_gemma. This integration allows the system to process natural language commands (“Find the red mug”) and translate them into structured policy goals for the CMS.

The architecture is explicitly modular, designed to scale with the available compute:

Base Tier (Raspberry Pi 5): Runs the core HOPE policy and CMS.
Augmented Tier (Pixel 10 + TPU): Offloads the heavy lifting of semantic inference and large-scale vision processing to a tethered mobile device.
Spatial Tier (Samsung XR): Provides the highest fidelity of ego-centric data and gaze-based intent modeling.

The “North Star” Consumer Setup

The ultimate vision for democratizing this technology relies on a specific consumer hardware convergence:

Vision: Samsung XR Glasses/Headset for spatial data ingestion and immersive tele-op.
Brain: Google Pixel 10 (or equivalent) leveraging next-gen TPU silicon for accelerated on-device inference.
Body: An open-source robot kit acting as the actuation end-effector.

This configuration represents the ideal balance of accessibility and performance, enabling a “bring your own brain” model where the user’s existing mobile and XR hardware powers the robot.

Architectural Theory: HOPE vs. DiT/ACT

A core component of the ContinuonAI research is the HOPE (Hierarchical On-device Policy Engine) architecture. This section clarifies the theoretical positioning of HOPE relative to contemporary multimodal architectures like Diffusion Transformers (DiT) and Adaptive Computation Time (ACT) decoders.

Architectural Alignment

The question often arises: which component of the standard multimodal pipeline does HOPE replace? HOPE is designed as a unified state evolution and decoding engine, effectively subsuming the roles of:

Global Mixing: Replacing the quadratic complexity of Transformer self-attention with linear-time state-space dynamics.
Decoder Logic: Replacing standard autoregressive decoders with a continuous memory-conditioned policy.
Dynamics Modeling: Replacing the flow-matching mechanism of DiT with a continuous state evolution operator.

Component	Function	HOPE Equivalent
Flow-matching DiT	Continuous denoiser dynamics	Continuous State Evolution
Transformer Decoder	Autoregressive state mapping	CMS Read/Write + Local Policy
ACT Decoder	Adaptive computation	Nested Layers & Structural Skipping
SSM Models (Mamba)	Long-range sequence modeling	Continuous Memory Updates

Mathematical Formulation

The replacement of established components like DiT requires a mathematically stable recurrence. The HOPE architecture is defined by the following state update loop at step $t$:

Input Encoding: $$e_t = E_\phi(x_t, a_{t-1}, r_t)$$
CMS Retrieval: $$c_t = \text{Read}\psi(M{t-1}, s_{t-1}, e_t)$$
State Evolution (World Model): $$s_t = F_{\Theta_t}(s_{t-1}, e_t, c_t)$$
CMS Update (Hierarchical Write): $$M_t = G_\psi(M_{t-1}, s_t, e_t, r_t)$$
Nested Learning (On-Device Update): $$\Theta_t = \Theta_{t-1} + \eta_t U_\xi(s_t, M_t, r_t)$$
Action Decoding: $$y_t, u_t = H_\omega(s_t, c_t)$$

The Wave-Particle Decoder

To implement the state evolution function $F_{\Theta_t}$ without attention, HOPE utilizes a hybrid Wave-Particle mechanism:

Wave Stream: A linear-time State-Space Model (SSM) update capturing long-range global dependencies ($w_t$).
Particle Stream: A local nonlinear update (MLP/Conv) capturing short-range, high-frequency interactions ($p_t$).
Gated Fusion: $$s_t = s_{t-1} + g_t \odot \text{Particle}(p_t) + (1 - g_t) \odot \text{Wave}(w_t)$$

This design ensures $O(T)$ scaling, which is a strict requirement for long-duration operation on edge devices like the Raspberry Pi 5. Furthermore, the nested update rule ($\Theta_t$) enables continuous, on-device adaptation based on the reward signal $r_t$, a capability not present in static Transformer deployments.

Forward Outlook

The immediate roadmap focuses on validating the hardware-software integration:

Latency Profiling: Quantifying the end-to-end latency from the hardware abstraction layer (currently ROS 2) to the ContinuonAI application.
Navigation Hypothesis: Validating the “Pi 5 + Tele-op” configuration for robust hallway navigation without video signal degradation.
Manipulation Setup: Integrating the robotic arms to commence initial pick-and-place experiments upon hardware arrival.

This phase establishes the foundational infrastructure required to rigorously test the HOPE architecture and the spatial intelligence capabilities enabled by LiveCaptionsXR.