LiveCaptionsXR · Craig Merry

The problem

Flat transcripts don’t work in 3D. In a headset, audio can come from anywhere, and a single scrolling caption tells you what was said but never who said it or where they are. I was born mostly deaf. This is the gap I live with.

The solution

LiveCaptionsXR positions captions in 3D space, near whoever is speaking, and lets them follow people as they move. It’s a production Flutter app with on-device speech recognition: no audio ever leaves the device.

ARCore / ARKit spatial anchoring
GCC-PHAT stereo direction-of-arrival, fused with vision and IMU via a Kalman filter
Speaker diarization so captions attach to the right person

On-device AI

It runs ASR on the Qualcomm Hexagon NPU via the Nexa SDK (Parakeet TDT), with a Whisper GGML CPU fallback on other Android and Apple Speech on iOS. Moving to the NPU roughly halved latency and made all-day headset use thermally viable.

What I learned

Beta testers, Deaf friends, instinctively turned toward wherever a caption appeared, regaining spatial awareness they described as transformative. That single observation drove every iteration on positioning, size, and contrast.

The problem

The solution

On-device AI

What I learned

Want something like this, built to ship?