Building LiveCaptionsXR with Nexa AI and Qualcomm's QDC Test Environment
When I first built LiveCaptionsXR, the vision was clear: real-time, spatially-aware closed captions that appear exactly where voices originate in 3D space. But achieving this at scale—with the low latency and energy efficiency required for XR headsets—required a fundamental shift from CPU-based processing to dedicated AI acceleration hardware.
Enter Nexa AI and Qualcomm’s Hexagon NPU. This post chronicles how we integrated Nexa SDK into LiveCaptionsXR and validated the entire pipeline on Qualcomm Developer Cloud (QDC), transforming the app from a proof-of-concept into a production-ready accessibility tool powered by on-device AI.
The Challenge: Real-Time AI at Scale
LiveCaptionsXR serves 466 million people worldwide with hearing loss. The core requirements are unforgiving:
- Sub-500ms latency for real-time captioning
- Energy efficiency critical for XR headsets (battery life matters)
- 100% on-device processing for privacy (no audio data ever leaves the device)
- Concurrent AI workloads (ASR + LLM enhancement + visual understanding)
Our initial stack used Whisper GGML (CPU) and Gemma 3n (CPU/GPU), which worked but struggled with thermal throttling and battery drain during extended use. For XR glasses, where users might wear them for hours, this wasn’t sustainable.
Discovering Nexa SDK
The Nexa SDK, developed in partnership with Qualcomm, provides a unified interface for running AI models directly on the Hexagon NPU (Neural Processing Unit). The promise was compelling:
- 2x faster inference compared to CPU-only processing
- 9x better energy efficiency (critical for XR)
- Native NPU acceleration without low-level DSP programming
- Model hub with pre-optimized models for ASR, LLM, and VLM tasks
This aligned perfectly with LiveCaptionsXR’s architecture. We could replace Whisper with Nexa’s NPU-accelerated ASR and Gemma 3n with Nexa’s LLM models, all while maintaining our privacy-first, on-device approach.
The Integration Journey
Phase 1: ASR on NPU
The first step was replacing our Whisper GGML service with Nexa’s ASR capabilities. We chose the Parakeet TDT 0.6B model—a 0.6GB model optimized for real-time speech recognition on Hexagon NPU.
// lib/core/services/nexa_asr_service.dart
class NexaAsrService implements ISpeechService {
Future<void> initialize() async {
// Initialize Nexa SDK with NPU plugin
await _nexaPlatform.initialize(
plugin: PluginType.NPU,
modelId: 'parakeet-tdt-0.6b-v3-npu',
);
}
Stream<SpeechResult> transcribeStream(Float32List audioData) async* {
// Real-time transcription on NPU
final result = await _nexaPlatform.transcribe(audioData);
yield SpeechResult(text: result.text, confidence: result.confidence);
}
}
The integration required creating a Flutter method channel to bridge Dart and the native Android Kotlin code that interfaces with Nexa SDK:
// android/app/src/main/kotlin/com/livecaptionsxr/app/NexaAsrPlugin.kt
class NexaAsrPlugin : MethodCallHandler {
private lateinit var asrWrapper: AsrWrapper
override fun onMethodCall(call: MethodCall, result: Result) {
when (call.method) {
"initialize" -> {
asrWrapper = AsrWrapper.builder()
.setPlugin(PluginType.NPU)
.setModelId("parakeet-tdt-0.6b-v3-npu")
.build()
result.success(true)
}
"transcribe" -> {
val audioData = call.arguments as ByteArray
val transcription = asrWrapper.transcribe(audioData)
result.success(transcription)
}
}
}
}
Phase 2: LLM Enhancement on NPU
Next, we integrated Nexa’s LLM for text enhancement. The LFM2-1.2B model (0.75GB) runs on NPU to add punctuation, capitalization, and context to raw transcriptions:
// lib/core/services/nexa_llm_service.dart
class NexaLlmService {
Future<String> enhanceText(String rawText) async {
final prompt = "Add punctuation and capitalization to: $rawText";
final enhanced = await _nexaPlatform.generate(
prompt: prompt,
modelId: 'LFM2-1.2B-npu',
maxTokens: 100,
);
return enhanced.text;
}
}
This replaced our previous Gemma 3n service, reducing model size from 4.11GB to 0.75GB while gaining NPU acceleration.
Phase 3: Visual Context with OmniNeural-4B
For future multimodal capabilities, we also integrated OmniNeural-4B (4GB), a vision-language model that can understand visual context. While not yet in production, this enables features like identifying speakers through face detection and understanding environmental context.
Qualcomm Developer Cloud: The Validation Platform
Testing NPU-accelerated AI on physical hardware is challenging—especially when targeting cutting-edge Snapdragon chipsets. Qualcomm Developer Cloud (QDC) solved this by providing remote access to reference devices, including the Snapdragon 8 Elite (QRD8750).
QDC Testing Workflow
- Device Provisioning: Requested access to a Snapdragon 8 Elite device via QDC portal
- APK Deployment: Built a debug APK with Nexa SDK integration and deployed via ADB
- NPU Verification: Confirmed Hexagon NPU availability and initialization
- End-to-End Testing: Validated the complete pipeline from audio capture to AR caption rendering
Test Results (v1.0.34+)
Our QDC testing validated the entire stack:
✅ Nexa ASR initialized in NPU mode - Parakeet TDT 0.6B loaded successfully on Hexagon NPU
✅ Audio capture pipeline operational - 16kHz stereo audio streaming working correctly
✅ Real-time transcription pipeline functional - Sub-500ms latency achieved end-to-end
✅ LLM text enhancement working - LFM2-1.2B on NPU adding punctuation and context in real-time
✅ AR caption placement stable - Captions anchored correctly in 3D space via ARCore
The QDC environment was invaluable for:
- Performance benchmarking without needing physical hardware
- Validating NPU initialization and model loading
- Testing edge cases in a controlled environment
- Documenting results for the Nexa AI bounty program submission
Architecture: The Complete Pipeline
Here’s how the integrated system works:
Audio Capture (16kHz stereo)
|
Nexa ASR (Hexagon NPU) → Parakeet TDT 0.6B
|
↓ Speech-to-Text
|
Nexa LLM (Hexagon NPU) → LFM2-1.2B
|
↓ Punctuation & Enhancement
|
Speaker Diarization → Voice Embedding → Speaker ID
|
Hybrid Localization (Kalman filter: audio + visual + IMU)
|
ARCore → 3D Caption Placement at Speaker Location
Key Models in Production
| Model | Type | Size | NPU | Purpose |
|---|---|---|---|---|
| Parakeet TDT 0.6B | ASR | 0.6 GB | Yes | Real-time speech-to-text |
| LFM2-1.2B | LLM | 0.75 GB | Yes | Caption enhancement & punctuation |
| OmniNeural-4B | VLM | 4 GB | Yes | Visual context awareness (future) |
| Whisper GGML | ASR | 141 MB | No | Fallback for non-Snapdragon devices |
Performance Gains
The migration to Nexa SDK delivered measurable improvements:
- Latency: Reduced from ~800ms (CPU) to ~400ms (NPU) for ASR
- Energy Efficiency: 9x better power consumption during continuous transcription
- Thermal Performance: No thermal throttling during 30+ minute sessions
- Concurrent Processing: NPU handles ASR while CPU/GPU remain free for AR rendering
These gains are critical for XR headsets, where battery life and thermal management are primary constraints.
Lessons Learned
1. Model Selection Matters
Choosing the right model size is a balancing act. Parakeet TDT 0.6B provides excellent accuracy for real-time captioning while staying within NPU memory constraints. Larger models (like OmniNeural-4B) offer more capabilities but require careful memory management.
2. Fallback Strategies Are Essential
Not all Android devices have Snapdragon chipsets with Hexagon NPU. We maintain Whisper GGML as a CPU fallback, ensuring the app works across all Android devices while optimizing for Snapdragon when available.
3. QDC Accelerated Development
Access to QDC eliminated the hardware procurement bottleneck. We could test on cutting-edge chipsets without purchasing expensive reference devices, dramatically accelerating our development cycle.
4. Privacy-First Architecture
Nexa SDK’s on-device processing aligns perfectly with LiveCaptionsXR’s privacy-first approach. All audio processing happens locally—no data ever leaves the device, which is critical for accessibility tools handling sensitive conversations.
The Impact
LiveCaptionsXR with Nexa SDK integration represents a significant step forward for accessibility technology:
- 466 million people with hearing loss can benefit from real-time spatial captions
- Privacy-first processing ensures sensitive conversations stay local
- Energy-efficient design enables all-day use on XR headsets
- Production-ready architecture validated on Qualcomm’s reference hardware
What’s Next
The integration is complete, but the journey continues:
- OmniNeural-4B integration for visual context awareness and speaker identification
- Multi-language support using Nexa’s translation capabilities
- Samsung Galaxy XR optimization for the upcoming XR headset launch
- Open-source contributions to the Nexa SDK Flutter plugin
Acknowledgments
This integration wouldn’t have been possible without:
- Nexa AI for the SDK and model optimization
- Qualcomm for QDC access and Hexagon NPU architecture
- The open-source community contributing to Flutter and ARCore
The combination of Nexa SDK’s developer-friendly API and Qualcomm’s QDC testing environment made it possible to build production-grade on-device AI for accessibility—something that would have been prohibitively complex just a few years ago.
LiveCaptionsXR is open source and available on GitHub. Download the latest APK from the releases page or visit livecaptionsxr.com to learn more.
Built with Nexa SDK for Qualcomm Hexagon NPU acceleration.