Labs
WebAI: chat with this site
A privacy-first, client-side chat that uses WebGPU models (WebLLM) and on-device embeddings to answer questions about my resume and portfolio.
WebAI
Chat with this site
Answers are generated locally with a WebGPU-friendly model (WebLLM) and a small retrieval index. No data is sent to a server.
Best on desktop or modern mobile browsers with WebGPU.
Ask about the resume or portfolio highlights.
How it works
- • Content chunks live in
/ai/webai-chunks.json. - • A local embedder (
@xenova/transformers) builds vectors in your browser. - • Retrieval ranks the most relevant chunks and feeds them to WebLLM.
- • Nothing leaves your device; GPU-accelerated when available, falls back to WASM.
Performance tradeoffs
- Model size vs speed: 3B/4B models load fast and answer quickly; 7B improves quality but costs bandwidth and startup time.
- WebGPU vs WASM: WebGPU is fastest; WASM fallback is private but slower on first-token latency.
- Embeddings build-time vs client-time: Precomputing keeps TTFB low; on-the-fly embedding adds seconds on cold start.
- Chunking: 300–500 token chunks balance recall with smaller vector payloads; reduce chunk count to speed load.
- Payload split: Ship per-section indexes (e.g., resume-only) to keep initial downloads small.