The problem
Flat transcripts don’t work in 3D. In a headset, audio can come from anywhere, and a single scrolling caption tells you what was said but never who said it or where they are. I was born mostly deaf — this is the gap I live with.
The solution
LiveCaptionsXR positions captions in 3D space, near whoever is speaking, and lets them follow people as they move. It’s a production Flutter app with on-device speech recognition — no audio ever leaves the device.
- ARCore / ARKit spatial anchoring
- GCC-PHAT stereo direction-of-arrival, fused with vision and IMU via a Kalman filter
- Speaker diarization so captions attach to the right person
On-device AI
It runs ASR on the Qualcomm Hexagon NPU via the Nexa SDK (Parakeet TDT), with a Whisper GGML CPU fallback on other Android and Apple Speech on iOS. Moving to the NPU roughly halved latency and made all-day headset use thermally viable.
What I learned
Beta testers — Deaf friends — instinctively turned toward wherever a caption appeared, regaining spatial awareness they described as transformative. That single observation drove every iteration on positioning, size, and contrast.