Ranking 50,000 recipes on a CPU that embeds 16 phrases a second

PantryAtlas is a recipe app that runs entirely on a Raspberry Pi 5 — pantry-in, ranked-recipes-out, fully offline. The launch post covers what it is and how to get it. This one is about the constraint that shaped the whole thing, and the three or four decisions that fell out of it.

The constraint that designed the app

The natural way to match a pantry against recipes is semantic: embed the user’s ingredients, embed each recipe’s, compare vectors, rank by similarity. PantryAtlas uses bge-m3 for multilingual embeddings, which is exactly right for the job — tomate and tomato land in the same neighborhood.

The problem is where it runs. On the Pi 5’s CPU, bge-m3 embeds roughly 16 short strings per second. That’s a hard floor, not a tuning opportunity. A pantry of a dozen items against tens of thousands of recipes, each needing its missing ingredients embedded for substitution scoring, is thousands of embeddings. At 16 a second, the honest answer to “what can I make?” would be “ask me again in a few minutes.” Unusable.

So the architecture isn’t “make embedding fast.” It’s never make the user wait on embedding.

Instant → refine → swaps

The screen resolves in three waves, each one cheaper-to-compute and earlier than the last:

Instant (~1 second, zero embeddings). When you ask for recipes, the server ranks in fast mode: pure coverage scoring — how many of the recipe’s ingredients you already have — with no semantic work at all. The list paints immediately, best-coverage first, with a small “refining…” chip.
Refine (background, batched). A second call embeds every unique missing ingredient across the candidate set in one batched embedding call, computes the real substitution scores, and re-ranks. The list re-orders under you as the real numbers arrive.
Swaps (lazy, on demand). Per-recipe substitution suggestions (“no buttermilk? use milk + lemon”) are computed only when you expand a card, with a cosine floor so it doesn’t suggest nonsense.

The ranking formula the refine pass converges to:

score = 0.50·coverage
      + 0.20·expiration
      + 0.20·(1 − substitution)
      + 0.10·cultural_fit

There’s one detail I’m quietly proud of. In fast mode the substitution penalty is set to zero — optimistic. So the substitution term contributes a flat 0.20 and doesn’t affect the instant ordering. When refine computes the real penalty, it can only ever lower a score, never raise it. Which means the cards settle downward into a stable order — they never jump upward and yank a recipe out from under your cursor as you’re reading it. The motion is always in one direction, and it always converges. A janky “results reshuffling while you read” experience turned into a calm one, for free, by choosing the sign of the optimism deliberately.

Relaxed JSON beats grammar-constrained — on a Pi

PantryAtlas wraps Gemma 4 (via llama.cpp) for the bits that need a language model: parsing a photographed shelf, generating swap rationales. The textbook way to get structured output is grammar-constrained decoding — force the model to emit tokens that match a JSON schema.

On the Pi, grammar-constrained strict mode was painfully slow — minutes per batch under a tight JSON schema, enough to flip the economics of the whole feature. So the default is the opposite: let Gemma generate freely, then run a relaxed-JSON repair loop (up to two retries) that tolerates trailing commas, code fences, and the usual small-model JSON sins, and only repairs what’s broken. Strict grammar mode is still there as an opt-in — with a latency warning attached — but the fast, forgiving path is what ships. On constrained hardware, parse-and-repair beat constrain-and-wait.

The Gemma runner also auto-selects the model to fit the RAM it finds: E4B when there’s headroom, E2B when there isn’t, so the same code runs on an 8 GB Pi and a smaller box.

Vision that fails gracefully

Photographing your shelf uses Gemma 4’s multimodal path. The honest status: the endpoint and the whole UI around it ship now; full live mmproj wiring is a follow-up. The important engineering choice is what happens when the multimodal model isn’t loaded — the endpoint returns a clean 503 vision_unavailable, and the photo-review sheet in the UI was built from day one to handle that 503 and 404 gracefully, showing “vision not configured” instead of breaking. The feature can land incrementally without the app ever showing the user a stack trace.

Offline that actually syncs

It’s a PWA, so “offline” has to mean more than “the page loads.” The service worker uses three different caching strategies on purpose: app-shell cache-first (the UI always loads), recipes stale-while-revalidate (capped at 20 so it doesn’t balloon), pantry network-first (you want the freshest list when you’re online).

Writes are the hard part offline. Every pantry mutation goes into an IndexedDB queue and replays on reconnect, so adding an onion with the Wi-Fi down isn’t lost — it syncs the moment the Pi is reachable again.

What this adds up to

None of these are exotic techniques. What’s interesting is that they all fall out of one honest measurement — 16 embeddings a second — taken seriously instead of wished away. The instant-paint, the downward-settling ranking, the parse-don’t- constrain JSON, the graceful 503: each is a direct answer to “the slow thing is slow, so don’t put the user behind it.”

The code is at github.com/PantryAtlas/pantryatlas (Apache 2.0); the navigator’s full API reference and ranking writeup are in the docs. If you’re curious how I get the 227 MB recipe database and a 1.9 GB SD image to the public for about zero dollars a month, that’s the distribution post.