Skip to main content

Gemini 3 Thinking vs Gemini 3 Fast: A Live Captioning XR Single Shot Showdown

4 min read By Craig Merry
Gemini XR LiveCaptionsXR AI Spatial Computing SingleShot

Today, I decided to run an experiment after seeing a social media post of examples of Gemini 3 usage in generating single-shot Canvas applications.

I haven’t worked on the Live Captions XR app as a Web offering, rather just as a iOS and Android (with Android XR) app. I didn’t think a web browser would be able to handle the requirements for a minimum, viable user experience.

The Setup

The goal of Live Captions XR is to place captions exactly where the sound is coming from. This requires not just transcribing text, but doing it with incredibly low latency and high accuracy so the text anchors to the speaker in real-time.

I used an iPhone 16 Pro Max with the latest iOS to record the video and input the initial prompts for both Gemini 3 Thinking and Gemini 3 Fast for this test.

I had installed Gemini app on my iPhone 16 Pro Max and had it open to the latest version of the Gemini App.

The Prompt: “Build a XR Web App for Live Captions. Also try to identify humans and where they are so that we can place captions next to the person speaking.”

For this test, I set up a “Canvas” application—a virtual interface floating in my XR space—that pipes audio data to the Gemini API and renders the resulting captions at the estimated coordinates of the sound source.

Results

Gemini 3 Fast Demo:

In this video, the results are pretty close to expectations of a Canvas driven application with more recent expectations of Vibe Coding and LLMs. The experience is a boilerplate application and with some more intensive desktop IDE Vibe Coding, could make it a more pleasing demo.

In XR/VR/AR and closed captioning, latency is the enemy of immersion. If a person speaks and the caption appears 500ms later, the illusion breaks. This is the key component of performance when I test Live Captions XR.

Context and Accuracy: Decent Captioning performing in a quiet environment with minimal background noise. No idea of who is speaking or where they are.

Spatial Feel: The captions are a generic, non-camera perspective in a animated space. The captions are not placed in the correct location.

Gemini 3 Thinking Demo:

This is why I wrote this post: This is next level of Vibe Coding. It’s not just a boilerplate application, it’s a clear next stage level of what’s possible with Vibe Coding and LLMs. I’m very impressed with how well the single-shot was able to compile the requirements cohesively and effectively. Yes, the app is rough, but the concept is almost there as a complete boilerplate - in the web!

Context and Accuracy: It actually does an admirable job of transcribing the text.

Spatial Feel: The app does attempt to put a captioning box in the frame of the speaker could be but that could just be bias towards whatever microphone it’s utilizing.

Conclusion

Google has completely changed their strategic outlook in the last few years. Earlier this year, the progress that they had made with Gemini 2.5 and derivatives like Gemma 3n were eye-opening. “AI” everywhere and see what the use cases are for consumers was a little bit of a head-scratcher two years ago, but now it’s all seemingly starting to percolate into concrete products with new combinations. Absolutely excited for Google, staffers and those in the ecosystem. I’m definitely rebalancing my research towards Google services when the choice is available.

Future Work

I plan on using this as a initial development platform for the Web version of Live Captions XR. It doesn’t cost anything to use and it’s a great way to demonstrate the potential with a XR headset.

Gemini 3 Thinking: https://gemini.google.com/share/868762a092f0

Gemini 3 Fast: https://gemini.google.com/share/3ec06925d2a7