Build Reliable, Accurate RAG.

The RAG Blueprint

Accelerate your path to production with a best-practice template that prioritizes retrieval quality, inference speed, and operational scale. Learn the basis of how cutting-edge RAG applications such as Perplexity are built on Vespa.

Go to The RAG Blueprint

Best Practices eBook

A Smarter Starting Point for RAG

The RAG Blueprint is a modular application template for designing, deploying, and testing production-grade RAG systems. Built on the same core architecture that powers Perplexity, it codifies best practices for building accurate and scalable retrieval pipelines using Vespa’s native support for hybrid search, phased ranking, and real-time inference. Designed for developers and architects, the Blueprint serves as a hands-on guide for production-ready implementations, helping teams move faster without compromising on quality or control.

Read the Manager’s Guide ↗

The RAG Blueprint

Defining your searchable unit

The searchable unit is the unit of information that is searched and passed to the LLM, often referred to as a chunk. Being too fine grained with your searchable unit, i.e. chunking into too small pieces, can lead to lacking context for the LLM to provide good answers. It also increases system complexity and duplication in managing context and metadata across the multitude of chunks. Overly large units can lead to performance issues due to the size of the units being processed, and the large amount of context often leads to reduced quality in retrieval. Vespa allows you to do partial retrieval and scoring of large documents without needing to separately manage chunks nor duplicate metadata, while still maintaining high performance, even for large documents. This means that you don’t need to compromise between the retrieval phase and the generation phase. You can have the best of both worlds.

Read more
Retrieval strategy

In order to find the best documents and context during retrieval you need to combine multiple methods and signals. Text-only search misses out on relevant content that uses different wording, whilst semantic (vector) search can miss out on key search terms. You must use a hybrid strategy combining traditional text search scoring with vector search scores in order to get the best results. But that is only the beginning; to get truly excellent retrieval that better fits your needs you should utilise all relevant signals in your data. Perhaps you want more recent or more reputable sources to be scored higher? Or maybe you want a higher score if a document or item aligns with user preferences? Vespa allows you to combine all these signals in any way you want, even allowing you to dynamically update the weights of each signal on the fly for truly flexible and adaptable retrieval.

Read more
Ranking and reranking retrieved documents

Good ranking functions are needed for quality retrievals, but the better they are the more computationally expensive they become, and calculating them over all your data becomes unfeasible within the timeframe we have come to expect from modern retrieval systems. To surface the best documents, whilst maintaining expected latency, ranking has to be done in phases. The first phase should be a computationally cheap function that quickly finds relevant candidates. These candidates can then be evaluated and reranked in a second phase using a better more expensive function, thus finally retrieving only the very best documents. With Vespa this phased approach is built right in, and you can make your ranking functions as simple or complex as you like.

Read more
Machine learning for optimised document ranking

Modern retrieval solutions need to utilise machine learned models for document ranking. You can get decent results by manually combining signals in an intelligent way, but getting the correct weights of each signal is impossible with this approach. With Vespa you can collect training data from your application and use learned weights in your ranking functions to optimise retrieval. We recommend using a learned-weight linear combination of features in the first phase. For the second phase reranking you should use a more sophisticated method, eg. a Gradient Boosted Decision Tree (GBDT) model. for this purpose Vespa supports LightGBM, XGBoost and ONNX models.

Read more
Quantifying and evaluating ranking pipeline

When testing out ranking methods and functions it is important to have a robust testing strategy. It can be tempting to test the ranking function by throwing some favourite queries at the retrieval and manually assessing whether the results are satisfactory, however, this is rarely thorough enough. Retrieval of unstructured and mixed data is messy and your system can be better at specific tasks or with specific wordings of queries. You need to have a solid base of test queries, using many different wordings and targeting different sets of documents in order to evaluate your system properly.

Read more
Facilitating multiple usecases from a single application

RAG applications often need to accommodate multiple different use cases. Users sometimes want quick answers for simple queries, or sometimes they want to use deep-research modes where longer waiting times are accepted, and you therefore need some system to manipulate the query to reflect these tasks. You should define a number of query profiles to support the different use cases. Query profiles are essentially presets for querying Vespa, making different query types easy to define and easy to switch between: greatly reducing the need for any additional logic.

Read more

Benefits of The RAG Blueprint

Accuracy You Can Trust

Built on proven architectural patterns, The RAG Blueprint prioritizes precision in retrieval and ranking—helping you deliver more relevant, reliable answers from day one.

Proven Practices

Develop confidently with a proven, production-grade reference application that guides implementation while preserving full control over retrieval quality and system design.

Scalable Performance, Built In

Designed for large-scale applications, the Blueprint leverages Vespa’s native support for low-latency, hybrid retrieval over billions of documents.

Perplexity uses Vespa.ai to power fast, accurate, and trusted answers for millions of users.

With Vespa RAG, Perplexity delivers accurate, near-real-time responses to more than 15 million monthly users and handles more than 100 million queries each week.

Documentation Tutorial

Read The RAG Blueprint tutorial in the documentation

Read the Documentation

Python Notebook

This notebook goes more detailed through the code of The RAG Blueprint. Learn how to set up machine learned ranking and how to do proper testing and evaluation of a RAG application

Python Notebook

Sample Application

Follow the steps and deploy a trained, evaluated and fully functional RAG application

Go to Github

Getting Started

Or, if you are new to Vespa and want to know how to get started and learn the basics

Get started

The RAG Blueprint

A Smarter Starting Point for RAG

Defining your searchable unit

Retrieval strategy

Ranking and reranking retrieved documents

Machine learning for optimised document ranking

Quantifying and evaluating ranking pipeline

Facilitating multiple usecases from a single application

Benefits of The RAG Blueprint

Accuracy You Can Trust

Proven Practices

Scalable Performance, Built In

Perplexity uses Vespa.ai to power fast, accurate, and trusted answers for millions of users.

Read more

Documentation Tutorial

Python Notebook

Sample Application

Getting Started