Labs

WebAI: chat with this site

A privacy-first, client-side chat that uses WebGPU models (WebLLM) and on-device embeddings to answer questions about my resume and portfolio.

WebAI

Chat with this site

Private · Runs in your browser

Answers are generated locally with a WebGPU-friendly model (WebLLM) and a small retrieval index. No data is sent to a server.

Best on desktop or modern mobile browsers with WebGPU.

Ask about the resume or portfolio highlights.

• Content chunks live in /ai/webai-chunks.json.
• A local embedder (@xenova/transformers) builds vectors in your browser.
• Retrieval ranks the most relevant chunks and feeds them to WebLLM.
• Nothing leaves your device; GPU-accelerated when available, falls back to WASM.

Performance tradeoffs

Model size vs speed: 3B/4B models load fast and answer quickly; 7B improves quality but costs bandwidth and startup time.
WebGPU vs WASM: WebGPU is fastest; WASM fallback is private but slower on first-token latency.
Embeddings build-time vs client-time: Precomputing keeps TTFB low; on-the-fly embedding adds seconds on cold start.
Chunking: 300–500 token chunks balance recall with smaller vector payloads; reduce chunk count to speed load.
Payload split: Ship per-section indexes (e.g., resume-only) to keep initial downloads small.