Why we built Live Captions XR

4 min read By Craig Merry
Accessibility AR AI LiveCaptionsXR

For as long as I can remember, I’ve wanted to put on a pair of glasses and see who’s speaking—live captions, in the air, exactly where voices come from.

The problem: captions without context

I was born mostly deaf and have worn hearing aids most of my life. Traditional solutions never quite fulfilled the vision above. In a noisy coffee shop, I might try to follow a lively group discussion with hearing aids or a live‑caption app on my phone. But those apps give me a flat transcript—a wall of text detached from the environment. I can read what was said, but I can’t tell who said it or where they are. It’s disorienting and far from the rich, spatial experience hearing people have.

Hearing is not just about volume; it’s about context, directionality, and clarity amidst noise. Hearing aids help tremendously, but they don’t provide the spatial cues that are crucial for understanding conversations in dynamic settings.

From tinkering to XR

I’ve built various apps over the years, from simple mobile tools to computer‑vision projects. My machine‑learning journey started with TensorFlow Lite experiments and pre‑trained models for object detection and speech recognition. Progress in VR hardware—and the near‑future promise of Augmented Reality (AR) and Mixed Reality (MR) glasses under the Extended Reality umbrella (XR)—made me increasingly curious about how AR could enhance real‑world experiences. Building Bicycle Radar, an app that uses computer vision to detect approaching vehicles while cycling, deepened that curiosity.

When Google released their Gemma on‑device multimodal models alongside a Kaggle competition, I immediately saw a chance to combine accessibility, machine learning, and AR. I envisioned an app that could provide real‑time captions anchored in 3D space, letting users like me see who is speaking and where they’re located in the environment.

I decided to build Live Captions XR—an app that uses AR to display real‑time captions in the physical space where speech occurs. For the first time, conversations could feel natural and intuitive again, restoring some of the situational awareness hearing people take for granted. During the competition, I also connected with Sasha Denisov, a Flutter engineer in Germany who maintains an open‑source package for embedding AI models in Flutter apps. With his help, I integrated Gemma for efficient on‑device speech recognition and captioning.

Building Live Captions XR: giving sight to sound

Building Live Captions XR was an adventure powered by bleeding‑edge tech arriving just in time. Gemma’s on‑device speech models were fresh, and ARCore had matured enough to do reliable spatial tracking. I spent many late nights tuning stereo‑mic direction‑of‑arrival algorithms (the GCC‑PHAT trick from robotics) and wrestling with Flutter’s cross‑platform quirks to make AR work on both Android and iOS.

How it works (at a glance)

  • On‑device speech recognition with Gemma for low‑latency captions and privacy.
  • Direction‑of‑arrival estimation from stereo microphones using GCC‑PHAT to infer where speech is coming from.
  • AR anchoring with ARCore on Android (and ARKit on iOS) to place captions in 3D space where speech originates.
  • Flutter for the cross‑platform UI glue and performance‑sensitive rendering.

The moment it clicked

The first time everything came together, I stood in my living room, spoke toward my laptop, and watched my words appear on my emulator screen as a floating caption anchored exactly where I was standing. It felt like giving sight to sound. For the first time, conversation wasn’t a disembodied transcript—it was spatial. As a deaf person, that was transformative: I could see voices, knowing who was talking with a glance at the text positions.

What’s next?

Live Captions XR is early, but the potential is enormous. The code is open source and I welcome feedback and pull requests. I envision XR glasses providing real‑time captions seamlessly integrated into daily life, enhancing communication for deaf and hard‑of‑hearing people. Beyond accessibility, this could help language learners, professionals in noisy environments, and anyone who wants clearer understanding of spoken content.

I’m eagerly awaiting rumored Apple XR glasses and Samsung’s XR glasses—and I’d love Live Captions XR to be part of that experience. As XR evolves, I’m excited to keep building, testing, and refining what spatial captions can be.

The idea of using XR to enhance accessibility is just the beginning. Augmented reality can change how we interact with information and each other. By embedding digital content into our physical world, we can create intuitive, engaging experiences that benefit everyone—including AI systems learning to navigate our world.

Acknowledgements

I want to express my gratitude to Sasha Denisov for his invaluable assistance in integrating the Gemma model into Live Captions XR. His expertise in Flutter development and open‑source contributions have been instrumental in bringing this project to life. Thank you, Sasha, for your dedication and support!