Build Reliable, Accurate RAG.
The RAG Blueprint
Accelerate your path to production with a best-practice template that prioritizes retrieval quality, inference speed, and operational scale. Learn the basis of how cutting-edge RAG applications such as Perplexity are built on Vespa.
A Smarter Starting Point for RAG
The RAG Blueprint is a modular application template for designing, deploying, and testing production-grade RAG systems. Built on the same core architecture that powers Perplexity, it codifies best practices for building accurate and scalable retrieval pipelines using Vespa’s native support for hybrid search, phased ranking, and real-time inference. Designed for developers and architects, the Blueprint serves as a hands-on guide for production-ready implementations, helping teams move faster without compromising on quality or control.
-
Defining your searchable unit
The searchable unit is the unit of information that is searched and passed to the LLM, often referred to as a chunk. Being too fine grained with your searchable unit, i.e. chunking into too small pieces, can lead to lacking context for the LLM to provide good answers. It also increases system complexity and duplication in managing context and metadata across the multitude of chunks. Overly large units can lead to performance issues due to the size of the units being processed, and the large amount of context often leads to reduced quality in retrieval. Vespa allows you to do partial retrieval and scoring of large documents without needing to separately manage chunks nor duplicate metadata, while still maintaining high performance, even for large documents. This means that you don’t need to compromise between the retrieval phase and the generation phase. You can have the best of both worlds.
-
Retrieval strategy
In order to find the best documents and context during retrieval you need to combine multiple methods and signals. Text-only search misses out on relevant content that uses different wording, whilst semantic (vector) search can miss out on key search terms. You must use a hybrid strategy combining traditional text search scoring with vector search scores in order to get the best results. But that is only the beginning; to get truly excellent retrieval that better fits your needs you should utilise all relevant signals in your data. Perhaps you want more recent or more reputable sources to be scored higher? Or maybe you want a higher score if a document or item aligns with user preferences? Vespa allows you to combine all these signals in any way you want, even allowing you to dynamically update the weights of each signal on the fly for truly flexible and adaptable retrieval.
-
Ranking and reranking retrieved documents
Good ranking functions are needed for quality retrievals, but the better they are the more computationally expensive they become, and calculating them over all your data becomes unfeasible within the timeframe we have come to expect from modern retrieval systems. To surface the best documents, whilst maintaining expected latency, ranking has to be done in phases. The first phase should be a computationally cheap function that quickly finds relevant candidates. These candidates can then be evaluated and reranked in a second phase using a better more expensive function, thus finally retrieving only the very best documents. With Vespa this phased approach is built right in, and you can make your ranking functions as simple or complex as you like.
-
Machine learning for optimised document ranking
Modern retrieval solutions need to utilise machine learned models for document ranking. You can get decent results by manually combining signals in an intelligent way, but getting the correct weights of each signal is impossible with this approach. With Vespa you can collect training data from your application and use learned weights in your ranking functions to optimise retrieval. We recommend using a learned-weight linear combination of features in the first phase. For the second phase reranking you should use a more sophisticated method, eg. a Gradient Boosted Decision Tree (GBDT) model. for this purpose Vespa supports LightGBM, XGBoost and ONNX models.
-
Quantifying and evaluating ranking pipeline
When testing out ranking methods and functions it is important to have a robust testing strategy. It can be tempting to test the ranking function by throwing some favourite queries at the retrieval and manually assessing whether the results are satisfactory, however, this is rarely thorough enough. Retrieval of unstructured and mixed data is messy and your system can be better at specific tasks or with specific wordings of queries. You need to have a solid base of test queries, using many different wordings and targeting different sets of documents in order to evaluate your system properly.
-
Facilitating multiple usecases from a single application
RAG applications often need to accommodate multiple different use cases. Users sometimes want quick answers for simple queries, or sometimes they want to use deep-research modes where longer waiting times are accepted, and you therefore need some system to manipulate the query to reflect these tasks. You should define a number of query profiles to support the different use cases. Query profiles are essentially presets for querying Vespa, making different query types easy to define and easy to switch between: greatly reducing the need for any additional logic.
Benefits of The RAG Blueprint
Accuracy You Can Trust
Built on proven architectural patterns, The RAG Blueprint prioritizes precision in retrieval and ranking—helping you deliver more relevant, reliable answers from day one.
Proven Practices
Develop confidently with a proven, production-grade reference application that guides implementation while preserving full control over retrieval quality and system design.
Scalable Performance, Built In
Designed for large-scale applications, the Blueprint leverages Vespa’s native support for low-latency, hybrid retrieval over billions of documents.
Perplexity uses Vespa.ai to power fast, accurate, and trusted answers for millions of users.
With Vespa RAG, Perplexity delivers accurate, near-real-time responses to more than 15 million monthly users and handles more than 100 million queries each week.
Sample Application
Follow the steps and deploy a trained, evaluated and fully functional RAG application
Python Notebook
This notebook goes more detailed through the code of The RAG Blueprint. Learn how to set up machine learned ranking and how to do proper testing and evaluation of a RAG application