Project Overview
The Challenge
Virtual and augmented reality environments present unique accessibility challenges for deaf and hard-of-hearing users. Traditional captioning systems don't work effectively in 3D spaces where audio sources can be positioned anywhere.
Users need captions that appear in the correct spatial location, follow speakers as they move, and maintain readability in dynamic XR environments.
The Solution
LiveCaptionsXR provides real-time spatial captioning that positions text in 3D space relative to audio sources. The system uses speech recognition and spatial audio processing to create an accessible XR experience.
Captions appear near speakers, follow their movement, and adapt to user preferences for size, color, and positioning in virtual environments.
Key Features
Spatial Positioning
Captions appear in 3D space relative to audio sources, making it clear which person is speaking in virtual environments.
- • ARCore/ARKit spatial anchoring
- • GCC-PHAT stereo mic direction-of-arrival
- • Kalman filter: audio + visual + IMU fusion
Real-time Speech Recognition
Advanced speech-to-text processing with low latency for seamless conversation flow in virtual environments.
- • On-device, privacy-first (no audio leaves device)
- • Speaker identification via diarization
- • Multi-language support (planned via Nexa)
Accessibility Customization
User-configurable settings for caption appearance, positioning, and behavior to meet individual accessibility needs.
- • Caption appearance tuning (size, contrast)
- • Spatial positioning preferences
- • Privacy-first: all processing on-device
Cross-Platform Fallback
Optimized for Snapdragon devices with Hexagon NPU; falls back gracefully on any Android or iOS device.
- • Nexa SDK (NPU path) on Snapdragon
- • Whisper GGML (CPU fallback) on other Android
- • Apple Speech (SFSpeechRecognizer) on iOS
- • ARKit on iOS
Technical Stack
Mobile & XR
On-device AI
Use Cases & Applications
Virtual Education
Enable deaf/HoH students to participate fully in virtual classrooms, workshops, and training sessions with spatial captioning.
Remote Work
Facilitate inclusive virtual meetings and collaboration sessions for teams with deaf/HoH members.
Gaming & Entertainment
Make VR games and social platforms accessible with real-time spatial captioning for voice chat and audio content.
Performance (Hexagon NPU, Snapdragon 8 Elite)
Down from ~800ms on CPU — validated on QDC Snapdragon 8 Elite
Nexa SDK NPU path vs CPU-only Whisper GGML
Critical for all-day XR headset use
Development Process
Problem Definition
Motivated by personal experience — Craig was born mostly deaf and wears hearing aids. Existing caption apps produce flat transcripts with no spatial context, making it impossible to tell who is speaking or where. That gap drove the initial build.
Production App Development
Built as a production Flutter app. Migrated from Whisper GGML (CPU) to Nexa SDK with Parakeet TDT 0.6B on the Qualcomm Hexagon NPU — halving latency, eliminating cloud dependency, and resolving thermal throttling during extended XR sessions.
Beta Testing & Iteration
Beta-tested with Deaf friends — the key finding: users instinctively turned toward wherever a caption appeared, regaining spatial awareness they described as transformative. Iterated on caption positioning, size, and contrast from that feedback.
Hardware Validation
Validated the full pipeline on Qualcomm Developer Cloud (QDC) using a Snapdragon 8 Elite reference device — confirmed Hexagon NPU initialization, sub-500ms end-to-end latency, ARCore caption placement, and LFM2-1.2B text enhancement in production conditions.
What's Next
AI Capabilities
- • OmniNeural-4B (VLM) — visual context awareness and speaker identification by face
- • Multi-language support via Nexa's translation models
- • Open-source contributions to the Nexa SDK Flutter plugin
Platform Targets
- • Samsung Galaxy XR — optimization for upcoming XR headset
- • Apple XR glasses — spatial captioning in a glasses form factor
Interested in XR Accessibility Solutions?
Let's discuss how to make your virtual and augmented reality applications accessible to all users.