The Right Amount of Compute, at the Right Stage
Search, recommendations, and RAG systems are under pressure to evaluate more signals than ever: lexical relevance, vector similarity, freshness, popularity, personalization, business rules, model scores, even LLM-based relevance judgments. Running all of that against every matched document doesn't scale: it's too slow, and too expensive.
Vespa's multi-phase ranking solves this by ranking in stages instead of all at once. Fast ranking first, richer ranking later, and expensive models only on the best candidates. The practical effect: every dollar of compute goes toward results that actually have a shot at the top of the page, not toward scoring documents the user will never see.
That staging is what makes it possible to:
- Cut serving cost without cutting relevance. Expensive models only ever process a few hundred candidates rather than millions, so you get cross-encoder-level quality without cross-encoder-level infrastructure spend.
- Hit sub-second latency at scale. Cheap signals filter the field early; the system never pays the cost of heavy inference on documents that were never going to rank well anyway.
- Improve relevance without a re-architecture. Add a new model or signal at the phase where it belongs without redesigning the retrieval layer underneath it.
- Tune the cost/quality tradeoff yourself. Controls like rerank-count let you decide exactly how many candidates get the expensive treatment, per application, per query, even per experiment.