Skip to main content
LiveCaptionsXR Logo

LiveCaptionsXR

Spatial captioning for deaf/HoH users in XR environments

Project Overview

The Challenge

Virtual and augmented reality environments present unique accessibility challenges for deaf and hard-of-hearing users. Traditional captioning systems don't work effectively in 3D spaces where audio sources can be positioned anywhere.

Users need captions that appear in the correct spatial location, follow speakers as they move, and maintain readability in dynamic XR environments.

The Solution

LiveCaptionsXR provides real-time spatial captioning that positions text in 3D space relative to audio sources. The system uses speech recognition and spatial audio processing to create an accessible XR experience.

Captions appear near speakers, follow their movement, and adapt to user preferences for size, color, and positioning in virtual environments.

Key Features

📍

Spatial Positioning

Captions appear in 3D space relative to audio sources, making it clear which person is speaking in virtual environments.

  • • ARCore/ARKit spatial anchoring
  • • GCC-PHAT stereo mic direction-of-arrival
  • • Kalman filter: audio + visual + IMU fusion
🎤

Real-time Speech Recognition

Advanced speech-to-text processing with low latency for seamless conversation flow in virtual environments.

  • • On-device, privacy-first (no audio leaves device)
  • • Speaker identification via diarization
  • • Multi-language support (planned via Nexa)
⚙️

Accessibility Customization

User-configurable settings for caption appearance, positioning, and behavior to meet individual accessibility needs.

  • • Caption appearance tuning (size, contrast)
  • • Spatial positioning preferences
  • • Privacy-first: all processing on-device
🔄

Cross-Platform Fallback

Optimized for Snapdragon devices with Hexagon NPU; falls back gracefully on any Android or iOS device.

  • • Nexa SDK (NPU path) on Snapdragon
  • • Whisper GGML (CPU fallback) on other Android
  • • Apple Speech (SFSpeechRecognizer) on iOS
  • • ARKit on iOS

Technical Stack

Mobile & XR

Flutter (production Android app)
ARCore (Android) / ARKit (iOS) spatial anchoring
CI/CD pipeline — automated APK builds
Cloudflare Pages hosting
Spatial audio processing

On-device AI

Nexa SDK — Qualcomm Hexagon NPU-accelerated ASR
Parakeet TDT 0.6B (0.6 GB) — ASR, no cloud dependency
LFM2-1.2B (0.75 GB) — punctuation & caption enhancement on NPU
Speaker diarization + GCC-PHAT direction-of-arrival
Real-time text rendering
Whisper GGML (141 MB) — CPU fallback on non-Snapdragon Android
Apple Speech (SFSpeechRecognizer) — on-device ASR backend on iOS

Use Cases & Applications

🎓

Virtual Education

Enable deaf/HoH students to participate fully in virtual classrooms, workshops, and training sessions with spatial captioning.

💼

Remote Work

Facilitate inclusive virtual meetings and collaboration sessions for teams with deaf/HoH members.

🎮

Gaming & Entertainment

Make VR games and social platforms accessible with real-time spatial captioning for voice chat and audio content.

Performance (Hexagon NPU, Snapdragon 8 Elite)

~400ms
ASR Latency on NPU

Down from ~800ms on CPU — validated on QDC Snapdragon 8 Elite

Faster Inference

Nexa SDK NPU path vs CPU-only Whisper GGML

Better Energy Efficiency

Critical for all-day XR headset use

Development Process

1

Problem Definition

Motivated by personal experience — Craig was born mostly deaf and wears hearing aids. Existing caption apps produce flat transcripts with no spatial context, making it impossible to tell who is speaking or where. That gap drove the initial build.

2

Production App Development

Built as a production Flutter app. Migrated from Whisper GGML (CPU) to Nexa SDK with Parakeet TDT 0.6B on the Qualcomm Hexagon NPU — halving latency, eliminating cloud dependency, and resolving thermal throttling during extended XR sessions.

3

Beta Testing & Iteration

Beta-tested with Deaf friends — the key finding: users instinctively turned toward wherever a caption appeared, regaining spatial awareness they described as transformative. Iterated on caption positioning, size, and contrast from that feedback.

4

Hardware Validation

Validated the full pipeline on Qualcomm Developer Cloud (QDC) using a Snapdragon 8 Elite reference device — confirmed Hexagon NPU initialization, sub-500ms end-to-end latency, ARCore caption placement, and LFM2-1.2B text enhancement in production conditions.

What's Next

AI Capabilities

  • • OmniNeural-4B (VLM) — visual context awareness and speaker identification by face
  • • Multi-language support via Nexa's translation models
  • • Open-source contributions to the Nexa SDK Flutter plugin

Platform Targets

  • • Samsung Galaxy XR — optimization for upcoming XR headset
  • • Apple XR glasses — spatial captioning in a glasses form factor

Interested in XR Accessibility Solutions?

Let's discuss how to make your virtual and augmented reality applications accessible to all users.