Efficient Ranking at Any Scale

Multi-phase ranking progressively refines candidate documents, applying the right amount of compute at each stage to improve relevance while minimizing latency and cost.

Learn with Vespa Getting started with Vespa

Role in the AI Retrieval Workflow

Multi-phase ranking is the stage of the AI Retrieval Workflow that determines which candidate documents are most relevant before results are returned to the application.
Vespa's multi-phase ranking lets you use increasingly sophisticated ranking models without paying the cost of running them on every matching document. Fast, lightweight ranking identifies the most promising candidates before richer models rerank only the results that matter. The result is higher relevance, lower latency, and more efficient use of compute, even at large scale.

The Right Amount of Compute, at the Right Stage

Search, recommendations, and RAG systems are under pressure to evaluate more signals than ever: lexical relevance, vector similarity, freshness, popularity, personalization, business rules, model scores, even LLM-based relevance judgments. Running all of that against every matched document doesn't scale: it's too slow, and too expensive.
Vespa's multi-phase ranking solves this by ranking in stages instead of all at once. Fast ranking first, richer ranking later, and expensive models only on the best candidates. The practical effect: every dollar of compute goes toward results that actually have a shot at the top of the page, not toward scoring documents the user will never see.
That staging is what makes it possible to:

Cut serving cost without cutting relevance. Expensive models only ever process a few hundred candidates rather than millions, so you get cross-encoder-level quality without cross-encoder-level infrastructure spend.
Hit sub-second latency at scale. Cheap signals filter the field early; the system never pays the cost of heavy inference on documents that were never going to rank well anyway.
Improve relevance without a re-architecture. Add a new model or signal at the phase where it belongs without redesigning the retrieval layer underneath it.
Tune the cost/quality tradeoff yourself. Controls like rerank-count let you decide exactly how many candidates get the expensive treatment, per application, per query, even per experiment.

How Multi-phase Ranking Works

A fast first-phase ranking narrows millions of candidates down to a manageable set. The second phase reranks the strongest candidates on each content node. An optional global phase reranks the final top results after they've been merged across the cluster. The result: you can use cross-encoders, ONNX models, and other expensive ranking logic exactly where they add the most value, without paying for them everywhere else.

Without phase ranking, teams are stuck choosing between two bad options: run a cheap model everywhere and leave relevance on the table, or run an expensive model everywhere and pay for it in latency and infrastructure costs. Multi-phase ranking removes that tradeoff. You get the accuracy of the expensive model and the cost profile of the cheap one because each one runs only where it's earned its keep.

Configuring Multi-Phase Ranking

Multi-phase ranking is defined through rank profiles in the Vespa application schema. A single application can define multiple rank profiles for different use cases, markets, or experiments, and select between them at query time.
This isn’t a fixed pipeline. It's a set of controls your team owns:

Define ranking expressions: Define custom ranking logic directly in rank profiles, so relevance logic lives in your schema, not buried in application code.
Combine hybrid signals: Combine text, vector, tensor, metadata, and business signals into a single pipeline, rather than stitching results from separate systems.
Deploy machine learning models: Drop ONNX and tree-based models into second- or global-phase, where the candidate set is already small enough that inference is fast and affordable to run.
Control ranking at query-time controls: Select a rank profile and pass query-time inputs, so ranking behavior changes by user, market, or experiment without a redeploy or downtime.
Profile ranking: See exactly where ranking costs are going by phase, so tuning decisions are based on data rather than guesswork.

Learn with Vespa

Learn how to build search, recommendation, and RAG applications with Vespa through a free, self-paced course that combines hands-on exercises with links to the documentation.

Start the free course

Phased Ranking: Documentation for first-phase, second-phase, and global-phase ranking.
Ranking Expressions and Features: How Vespa defines ranking logic in rank profiles.
Ranking Basics: Introductory explanation of ranking in Vespa.
Ranking with ONNX Models: Guidance on using ONNX models, especially in second- and global-phase ranking.
Query API: How applications select rank profiles and pass query-time inputs.
RAG Blueprint Tutorial: A practical example of first-, second-, and global-phase ranking in a RAG application.
Query API Reference: Profiling options for ranking performance.

What is multi-phase ranking?

Multi-phase ranking is Vespa's approach to evaluating search and recommendation results in stages: fast and broad first, then progressively more expensive and narrow, so that costly models like cross-encoders run only on a small set of top candidates.

How many ranking phases does Vespa support?

Three: first-phase (runs on all matches, per content node), second-phase (optional, reranks top candidates per content node), and global-phase (optional, reranks the merged top results in the stateless container).

Why not just rank everything using the best available model?

Cost and latency. Running an expensive model, such as a cross-encoder or a large ONNX model, against every matched document at scale is computationally prohibitive. Multi-phase ranking reserves expensive models for the candidates that have already proven relevant in earlier, cheaper phases.

Can I use ONNX models in Vespa's ranking pipeline?

Yes. ONNX models, along with gradient-boosted models like XGBoost and LightGBM, can be used in second-phase or global-phase ranking, where the candidate set is small enough to make inference fast and affordable.

Does ranking behavior have to be the same for every query?

No. Rank profiles can be defined per use case, market, or experiment, and selected dynamically at query time, so the same Vespa application can rank differently for different users or contexts without a redeploy.