Skip to main content

Building an Autonomous Robot Learning Pipeline: From Training Games to Neural Networks

9 min read By Craig Merry
ContinuonAI Machine Learning Robotics RLDS Training MambaWave Autonomous Learning

Today I shipped something I’ve been thinking about for months: an autonomous learning pipeline that can generate its own training curriculum, run training cycles, and prepare data for neural network training—all without human intervention. Here’s the architecture, the hypotheses behind it, and what we learned from the first real training runs.

The Problem: Robot Learning Needs Data, But Data Is Expensive

Most robot learning systems hit the same wall: they need thousands of demonstrations to learn basic skills, but collecting those demonstrations requires humans in the loop. We wanted to test a different approach: what if the robot could generate its own training data through simulated games?

The ContinuonXR project has two brains:

  • Brain B: A simple pattern-based brain (~500 lines) that learns from demonstrations
  • ContinuonBrain: A full neural architecture that learns from RLDS episodes

The hypothesis: if Brain B can generate training episodes through simulated games, ContinuonBrain can learn from that synthetic data and eventually surpass Brain B’s capabilities.

The Training Pipeline Architecture

We built a five-phase learning cycle:

OBSERVE → SIMULATE → TRAIN → TRANSFER → DEPLOY

OBSERVE: Analyze the current state of Brain B and ContinuonBrain. How many episodes exist? What’s the current accuracy? What skills are missing?

SIMULATE: Generate new training games if data is insufficient. We aim for 100+ episodes before considering the data pool “healthy.”

TRAIN: Run Brain B’s simulator training on the accumulated episodes. Track accuracy, loss curves, and model versions.

TRANSFER: Prepare data for ContinuonBrain’s MambaWave architecture. Convert simple action predictions into world model training data.

DEPLOY: Update the live brain with improved models. Roll back if validation fails.

The Training Games Generator

The core innovation is training_games.py—a procedural game generator that creates diverse learning scenarios:

class GameType(str, Enum):
    NAVIGATION = "navigation"      # Point-to-point pathfinding
    PUZZLE = "puzzle"              # Keys, doors, logical sequences
    EXPLORATION = "exploration"    # Map discovery with fog of war
    INTERACTION = "interaction"    # Object manipulation
    SURVIVAL = "survival"          # Avoid hazards (lava, enemies)
    COLLECTION = "collection"      # Gather items optimally
    MULTI_OBJECTIVE = "multi"      # Combined challenges

Each game type generates RLDS episodes with different action distributions. A survival game teaches obstacle avoidance; a puzzle game teaches sequential reasoning; an exploration game teaches efficient coverage.

Difficulty scaling matters. We found that starting with difficulty 1-5 and gradually increasing to 15-20 produces better learning curves than uniform difficulty sampling. The generator automatically balances:

  • Grid sizes (5x5 to 15x15)
  • Obstacle density (10% to 40%)
  • Required actions per episode (10 to 100+)
  • Goal complexity (single target vs. multi-objective)

Hypothesis 1: Diverse Games Beat Repetitive Demonstrations

Traditional robot learning uses human demonstrations of specific tasks. We’re testing whether procedurally generated games with diverse objectives produce more generalizable behavior.

The training games generator creates episodes for seven different game types. Each game type exercises different capabilities:

Game TypePrimary SkillSecondary Skills
NavigationPath planningObstacle avoidance
PuzzleSequential logicState memory
ExplorationCoverage strategyUncertainty handling
InteractionObject manipulationTool use
SurvivalThreat assessmentRisk-reward balance
CollectionRoute optimizationResource management
Multi-objectiveTask prioritizationContext switching

First results: After generating 60+ episodes with 18,240 training steps, the action predictor showed improved balance across all action types. Navigation-only training tends to bias toward “move_forward”; multi-game training produces more uniform action distributions.

Hypothesis 2: Brain B as Teacher, ContinuonBrain as Student

We’re testing a curriculum learning approach where Brain B’s simple patterns bootstrap ContinuonBrain’s neural networks.

Brain B uses a linear model with 32-dimensional state embeddings and 8 action outputs:

{
  "model_type": "simulator_action_predictor",
  "input_dim": 32,
  "num_actions": 8,
  "action_vocab": [
    "move_forward", "move_backward",
    "rotate_left", "rotate_right",
    "spawn_asset", "spawn_obstacle",
    "reset", "noop"
  ]
}

This simple architecture can be trained quickly on a CPU. After training on 15,746 samples across 138 episodes, the linear model reached ~38% accuracy—above the 12.5% random baseline for 8 classes.

The hypothesis: Brain B’s learned weights capture useful priors about navigation that MambaWave can build upon. Instead of training MambaWave from scratch, we’ll initialize its action head with Brain B’s learned biases.

Hypothesis 3: State Observation Enables Self-Improvement

The learning partner system doesn’t just run training—it observes its own state and makes decisions based on current capabilities.

def assess_brain_state(self) -> Dict[str, Any]:
    """Observe current brain capabilities and gaps."""
    state = {
        "episode_count": self._count_rlds_episodes(),
        "training_samples": self._estimate_training_samples(),
        "model_accuracy": self._load_last_training_metrics(),
        "skill_coverage": self._analyze_action_distribution(),
        "data_freshness": self._check_episode_timestamps(),
    }

    # Generate learning goals based on gaps
    if state["episode_count"] < 100:
        self.goals.append(LearningGoal(
            type="data_generation",
            target="Generate 50+ new training episodes",
            priority=0.9
        ))

    return state

This self-observation creates a feedback loop: the system generates more puzzle games when puzzle actions are underrepresented, or more survival games when obstacle avoidance accuracy is low.

Early result: The system correctly identified that our initial 44 episodes were mostly Claude Code tool usage logs, not robot navigation data. It automatically generated 94 new robot-relevant episodes before triggering training.

Hypothesis 4: RLDS Provides Universal Training Currency

We standardized on RLDS (Robot Learning Data Standard) as the episode format. Every training game, human demonstration, and simulator run produces the same data structure:

{
  "episode_id": "nav_puzzle_d8_001_20260123_070315",
  "steps": [
    {
      "observation": {"grid_state": [...], "position": [2, 3]},
      "action": "move_forward",
      "reward": 0.1,
      "is_terminal": false
    }
  ],
  "metadata": {
    "game_type": "puzzle",
    "difficulty": 8,
    "source": "training_games_generator"
  }
}

This universality means we can mix data from:

  • Procedural games (synthetic)
  • Human demonstrations (teleop recordings)
  • Real-world runs (physical robot)
  • Simulated home environments (3D scanner data)

The hypothesis: a universal episode format lets us combine data sources that would otherwise be incompatible. Early evidence supports this—we successfully trained on mixed episodes from grid games, home environment simulations, and old teleop recordings.

Hypothesis 5: Simple Models First, Complex Models Later

We’re deliberately starting with a linear action predictor before scaling to MambaWave. The reasoning:

  1. Debug the pipeline with a model you can inspect directly
  2. Establish baselines before adding complexity
  3. Verify data quality through simple model behavior
  4. Learn curriculum design with fast iteration cycles

The linear model trains in seconds, making it practical to experiment with:

  • Different episode mixes
  • Action weighting strategies
  • Observation encodings
  • Reward shaping

Once we’re confident in the data pipeline, MambaWave takes over. The neural architecture is more powerful but also more opaque—we want to catch data issues before they propagate.

The MambaWave Architecture (Coming Next)

MambaWave combines two architectural innovations:

State Space Models (Mamba): Efficient sequence modeling with O(n) complexity instead of O(n²) attention. Perfect for real-time robot control where latency matters.

Spectral Processing (WaveCore): FFT-based filtering that captures periodic patterns in sensor data. Robot movements often have repetitive structure—walking gaits, scanning patterns, oscillating corrections.

class MambaWaveConfig:
    # Fast loop: real-time inference
    @classmethod
    def fast_loop(cls):
        return cls(d_model=64, n_layers=1, d_state=16)

    # Mid loop: online learning
    @classmethod
    def mid_loop(cls):
        return cls(d_model=128, n_layers=4, d_state=32)

    # Slow loop: batch training
    @classmethod
    def slow_loop(cls):
        return cls(d_model=256, n_layers=8, d_state=64)

The three-loop design matches Ralph’s learning architecture:

  • Fast loop runs at control rate (10-50 Hz), predicting immediate actions
  • Mid loop runs every few minutes, updating LoRA adapters online
  • Slow loop runs overnight, consolidating memory and training the full model

What We Learned Today

Running the full pipeline from data generation through training revealed several issues:

1. Episode prefix filtering matters. The training loader initially only accepted trainer_, sim_, and home3d_ prefixes. We had to add nav_, puzzle_, explore_, interact_, survive_, collect_, and multi_ to load the new game episodes.

2. Action vocabularies must align. Different data sources used different action names: “forward” vs “move_forward”, “left” vs “rotate_left”. We added a normalization layer:

command_map = {
    "forward": "move_forward",
    "backward": "move_backward",
    "left": "rotate_left",
    "right": "rotate_right",
    "turn_left": "rotate_left",
    "turn_right": "rotate_right",
}

3. Balanced data improves generalization. Navigation-only data biased the model toward “move_forward” (60%+ of predictions). Adding puzzle and survival games balanced the action distribution.

4. Linear models plateau quickly. At ~38% accuracy, the linear model stopped improving. This isn’t a bug—it’s the expected ceiling for a model without hidden layers. The signal: time to graduate to MambaWave.

Training Metrics

Final training run statistics:

MetricValue
Total episodes138
Training samples15,746
Unique action types8
Training epochs50
Final accuracy38.2%
Random baseline12.5%
Improvement over random3.06x

Model version history shows learning:

  • v1: Random init (12.5%)
  • v2: Navigation only (25.3%)
  • v3: +Home3D episodes (31.1%)
  • v4: +Training games (38.2%)

Next Steps

With the training pipeline validated, the next phase is:

  1. Enable MambaWave training using Brain B’s accumulated episodes
  2. Add world model prediction so the robot can plan ahead
  3. Integrate real hardware data from the physical robot
  4. Run the autonomous daemon for continuous improvement

The autonomous learning daemon (learning_partner.py) can now run unattended:

python scripts/compound/learning_partner.py --continuous --interval 3600

Every hour, it observes the brain state, generates new games if needed, runs training, and logs metrics. No human in the loop—just a robot teaching itself through play.

The Bigger Picture

This isn’t just a training pipeline. It’s a test of a hypothesis about robot learning: can we build systems that improve autonomously by generating their own curriculum?

The traditional approach requires human demonstrators, careful task design, and manual data curation. Our approach requires initial architecture and then… patience. The system generates games, plays them, learns from them, and decides what to practice next.

If this works at scale, the implications are significant. Robots could learn new skills by imagining training scenarios. They could identify their own weaknesses and generate targeted practice. They could share synthetic training data with other robots, accelerating collective learning.

For now, we have 138 episodes, a linear model hitting its ceiling, and a neural architecture waiting in the wings. The experiment continues.


ContinuonXR is an open-source project exploring embodied AI. Code at github.com/craigm26/ContinuonXR.