Building LiveCaptionsXR with Nexa AI and Qualcomm's QDC Test Environment

When I first built LiveCaptionsXR, the vision was clear: real-time, spatially-aware closed captions that appear exactly where voices originate in 3D space. But achieving this at scale—with the low latency and energy efficiency required for XR headsets—required a fundamental shift from CPU-based processing to dedicated AI acceleration hardware.

Enter Nexa AI and Qualcomm’s Hexagon NPU. This post chronicles how we integrated Nexa SDK into LiveCaptionsXR and validated the entire pipeline on Qualcomm Developer Cloud (QDC), transforming the app from a proof-of-concept into a production-ready accessibility tool powered by on-device AI.

The Challenge: Real-Time AI at Scale

LiveCaptionsXR serves 466 million people worldwide with hearing loss. The core requirements are unforgiving:

Sub-500ms latency for real-time captioning
Energy efficiency critical for XR headsets (battery life matters)
100% on-device processing for privacy (no audio data ever leaves the device)
Concurrent AI workloads (ASR + LLM enhancement + visual understanding)

Our initial stack used Whisper GGML (CPU) and Gemma 3n (CPU/GPU), which worked but struggled with thermal throttling and battery drain during extended use. For XR glasses, where users might wear them for hours, this wasn’t sustainable.

Discovering Nexa SDK

The Nexa SDK, developed in partnership with Qualcomm, provides a unified interface for running AI models directly on the Hexagon NPU (Neural Processing Unit). The promise was compelling:

2x faster inference compared to CPU-only processing
9x better energy efficiency (critical for XR)
Native NPU acceleration without low-level DSP programming
Model hub with pre-optimized models for ASR, LLM, and VLM tasks

This aligned perfectly with LiveCaptionsXR’s architecture. We could replace Whisper with Nexa’s NPU-accelerated ASR and Gemma 3n with Nexa’s LLM models, all while maintaining our privacy-first, on-device approach.

The Integration Journey

Phase 1: ASR on NPU

The first step was replacing our Whisper GGML service with Nexa’s ASR capabilities. We chose the Parakeet TDT 0.6B model—a 0.6GB model optimized for real-time speech recognition on Hexagon NPU.

// lib/core/services/nexa_asr_service.dart
class NexaAsrService implements ISpeechService {
  Future<void> initialize() async {
    // Initialize Nexa SDK with NPU plugin
    await _nexaPlatform.initialize(
      plugin: PluginType.NPU,
      modelId: 'parakeet-tdt-0.6b-v3-npu',
    );
  }
  
  Stream<SpeechResult> transcribeStream(Float32List audioData) async* {
    // Real-time transcription on NPU
    final result = await _nexaPlatform.transcribe(audioData);
    yield SpeechResult(text: result.text, confidence: result.confidence);
  }
}

The integration required creating a Flutter method channel to bridge Dart and the native Android Kotlin code that interfaces with Nexa SDK:

// android/app/src/main/kotlin/com/livecaptionsxr/app/NexaAsrPlugin.kt
class NexaAsrPlugin : MethodCallHandler {
    private lateinit var asrWrapper: AsrWrapper
    
    override fun onMethodCall(call: MethodCall, result: Result) {
        when (call.method) {
            "initialize" -> {
                asrWrapper = AsrWrapper.builder()
                    .setPlugin(PluginType.NPU)
                    .setModelId("parakeet-tdt-0.6b-v3-npu")
                    .build()
                result.success(true)
            }
            "transcribe" -> {
                val audioData = call.arguments as ByteArray
                val transcription = asrWrapper.transcribe(audioData)
                result.success(transcription)
            }
        }
    }
}

Phase 2: LLM Enhancement on NPU

Next, we integrated Nexa’s LLM for text enhancement. The LFM2-1.2B model (0.75GB) runs on NPU to add punctuation, capitalization, and context to raw transcriptions:

// lib/core/services/nexa_llm_service.dart
class NexaLlmService {
  Future<String> enhanceText(String rawText) async {
    final prompt = "Add punctuation and capitalization to: $rawText";
    final enhanced = await _nexaPlatform.generate(
      prompt: prompt,
      modelId: 'LFM2-1.2B-npu',
      maxTokens: 100,
    );
    return enhanced.text;
  }
}

This replaced our previous Gemma 3n service, reducing model size from 4.11GB to 0.75GB while gaining NPU acceleration.

Phase 3: Visual Context with OmniNeural-4B

For future multimodal capabilities, we also integrated OmniNeural-4B (4GB), a vision-language model that can understand visual context. While not yet in production, this enables features like identifying speakers through face detection and understanding environmental context.

Qualcomm Developer Cloud: The Validation Platform

Testing NPU-accelerated AI on physical hardware is challenging—especially when targeting cutting-edge Snapdragon chipsets. Qualcomm Developer Cloud (QDC) solved this by providing remote access to reference devices, including the Snapdragon 8 Elite (QRD8750).

QDC Testing Workflow

Device Provisioning: Requested access to a Snapdragon 8 Elite device via QDC portal
APK Deployment: Built a debug APK with Nexa SDK integration and deployed via ADB
NPU Verification: Confirmed Hexagon NPU availability and initialization
End-to-End Testing: Validated the complete pipeline from audio capture to AR caption rendering

Test Results (v1.0.34+)

Our QDC testing validated the entire stack:

✅ Nexa ASR initialized in NPU mode - Parakeet TDT 0.6B loaded successfully on Hexagon NPU
✅ Audio capture pipeline operational - 16kHz stereo audio streaming working correctly
✅ Real-time transcription pipeline functional - Sub-500ms latency achieved end-to-end
✅ LLM text enhancement working - LFM2-1.2B on NPU adding punctuation and context in real-time
✅ AR caption placement stable - Captions anchored correctly in 3D space via ARCore

The QDC environment was invaluable for:

Performance benchmarking without needing physical hardware
Validating NPU initialization and model loading
Testing edge cases in a controlled environment
Documenting results for the Nexa AI bounty program submission

Architecture: The Complete Pipeline

Here’s how the integrated system works:

Audio Capture (16kHz stereo)
        |
Nexa ASR (Hexagon NPU) → Parakeet TDT 0.6B
        |
        ↓ Speech-to-Text
        |
Nexa LLM (Hexagon NPU) → LFM2-1.2B
        |
        ↓ Punctuation & Enhancement
        |
Speaker Diarization → Voice Embedding → Speaker ID
        |
Hybrid Localization (Kalman filter: audio + visual + IMU)
        |
ARCore → 3D Caption Placement at Speaker Location

Key Models in Production

Model	Type	Size	NPU	Purpose
Parakeet TDT 0.6B	ASR	0.6 GB	Yes	Real-time speech-to-text
LFM2-1.2B	LLM	0.75 GB	Yes	Caption enhancement & punctuation
OmniNeural-4B	VLM	4 GB	Yes	Visual context awareness (future)
Whisper GGML	ASR	141 MB	No	Fallback for non-Snapdragon devices

Performance Gains

The migration to Nexa SDK delivered measurable improvements:

Latency: Reduced from ~800ms (CPU) to ~400ms (NPU) for ASR
Energy Efficiency: 9x better power consumption during continuous transcription
Thermal Performance: No thermal throttling during 30+ minute sessions
Concurrent Processing: NPU handles ASR while CPU/GPU remain free for AR rendering

These gains are critical for XR headsets, where battery life and thermal management are primary constraints.

Lessons Learned

1. Model Selection Matters

Choosing the right model size is a balancing act. Parakeet TDT 0.6B provides excellent accuracy for real-time captioning while staying within NPU memory constraints. Larger models (like OmniNeural-4B) offer more capabilities but require careful memory management.

2. Fallback Strategies Are Essential

Not all Android devices have Snapdragon chipsets with Hexagon NPU. We maintain Whisper GGML as a CPU fallback, ensuring the app works across all Android devices while optimizing for Snapdragon when available.

3. QDC Accelerated Development

Access to QDC eliminated the hardware procurement bottleneck. We could test on cutting-edge chipsets without purchasing expensive reference devices, dramatically accelerating our development cycle.

4. Privacy-First Architecture

Nexa SDK’s on-device processing aligns perfectly with LiveCaptionsXR’s privacy-first approach. All audio processing happens locally—no data ever leaves the device, which is critical for accessibility tools handling sensitive conversations.

The Impact

LiveCaptionsXR with Nexa SDK integration represents a significant step forward for accessibility technology:

466 million people with hearing loss can benefit from real-time spatial captions
Privacy-first processing ensures sensitive conversations stay local
Energy-efficient design enables all-day use on XR headsets
Production-ready architecture validated on Qualcomm’s reference hardware

What’s Next

The integration is complete, but the journey continues:

OmniNeural-4B integration for visual context awareness and speaker identification
Multi-language support using Nexa’s translation capabilities
Samsung Galaxy XR optimization for the upcoming XR headset launch
Open-source contributions to the Nexa SDK Flutter plugin

Acknowledgments

This integration wouldn’t have been possible without:

Nexa AI for the SDK and model optimization
Qualcomm for QDC access and Hexagon NPU architecture
The open-source community contributing to Flutter and ARCore

The combination of Nexa SDK’s developer-friendly API and Qualcomm’s QDC testing environment made it possible to build production-grade on-device AI for accessibility—something that would have been prohibitively complex just a few years ago.

LiveCaptionsXR is open source and available on GitHub. Download the latest APK from the releases page or visit livecaptionsxr.com to learn more.

Built with Nexa SDK for Qualcomm Hexagon NPU acceleration.