Vespa Now: 2025 Year in Review

Vespa Now Webinar Series

Vespa Now: 2025 Year in Review

Get an inside look at Vespa’s product innovations in 2025, now available to power your next generation of Retrieval-Augmented Generation (RAG), real-time search, and personalization experiences.
Get an overview on enhancements for:
– Query & Indexing Performance
– Ranking, Retrieval & Hybrid Search
– Vector/Tensor/Multimodal Support
– Developer Tooling & APIs
– Cloud Infrastructure & Operability

This transcript was computer generated and may contain errors.

Bonnie Chase:
Hi, everyone.
Thanks for joining us today at Vespa Now 2025 year in review. Now in this session, we’re excited to walk you through the major Best Buy product improvements delivered this year, and these are enhancements designed to help the help teams build faster, more accurate, and more scalable AI search and RAD applications. Now before we get started, I do have a few house keeping notes. This session is being recorded and you will get a copy of the recording and the slides within forty eight hours after the end of the webinar. And we’d love to get your thoughts and questions. So you’ll see a few options options on your screen in the right hand toolbar. Now first, the chat option. You can say hi and drop in some comments. Second, you’ll see a Q&A option.
Drop your questions here, and we’ll be addressing these throughout the session. Have a lot of information to cover today, so we wanna make sure that we get all of your questions And if we’re not able to answer them at the end of this session, we’ll definitely get back to you via email. Finally, you’ll see a resources tab where we dropped a few links for you.
Now with that, we’ll go ahead and get started.
So over the past year, Vespa’s engineering team has focused on three core themes. Performance, retrieval quality, and operational simplicity. And today, we’ll share how these product updates enable you to ship production grade AI applications with less complexity and more control. I’ll start with a quick high level overview of Vespa, and then we’ll dive into the updates within each theme. Along the way, we’ll highlight several spotlight updates that Thomas will dive into a little deeper. Now we have a lot to cover, and we want to keep things digestible. So again, we’ll share the links at the end, and you can explore each update more deeply at your own pace.
Now as a quick reminder for those of you who may not be as familiar with VAS we are the fastest and most scalable AI search platform for building enterprise grade RAG search and recommendation systems. And this is through unifying vector, text, structured data, and ML ranking into one high performance engine So this really helps create the fast, trust worthy, and massively scalable AI applications. That let’s jump into, query and indexing performance.
This year, we delivered major boost to core retrieval speed. Lexical queries are now up to three times faster, Grouping and facet logic is much more efficient, and the new compact sensor format reduces memory footprint for vector heavy workloads. We’ve also hardened ingestion with automatic binary detection so schemas don’t interrupt indexing.
This is at a high level, some of those key features that we’ve improved, with query and indexing performance I’ll go ahead and quickly walk through some of these. And as you can see, Thomas, he’ll highlight those spotlight items for us.
Now for automatic instance migration with VESPA cloud, we know that cloud providers like AWS, Azure, and GCP are constantly rolling out faster and more efficient instance types. So starting with June 2025, Vespa Cloud automatically moves your nodes to these newer generations as they become available. So this is really no action needed from your team, and you get better performance with no downtime. Again, if you’re on Vespa Cloud, you’ll you’ll have started seeing these automatic rollouts. And if you’re not on Vespa cloud yet, this is something to take into consideration where you do get that better performance with no downtime.
We’ve added binary data detection in string fields. So this, this really is about ensuring that, you know, if you get those messy or corrupt data into your Gen AI pipelines, we can automatically detect and reject that. So this really helps keep your application stable, secure, and reliable.
We also added rank score drop limit for more efficient global ranking. And if you’re familiar with this capability, Vespa’s rank core rank score drop limit lets the system discard candidate hits whose scores fall below a certain threshold. So these are those items that can’t possibly make it into the final top k results. Before, this limit only applied at the node local breaking stage. But after this update, the drop limit also applies during the global ranking phase.
We’ve also added geo box filtering. So with this update, you can now filter search results to a specific rectangle on a map using the new geo bounding box operator. Now this is an easy way to limit results to an exact area like a neighborhood or delivery zone. It works alongside Vespa existing point and radius geo filters and plugs directly into your ranking pipe line so you get precise geographic filtering without that extra complexity.
And then fasten and predicate filtering inside grouping. We’ve made grouping much more powerful you can now add filters directly inside your grouping expression using simple logic, as well as range filters. So that means you can run analytics and apply exclusions all at once without extra queries or post processing. So for example, you can group purchases by customer while filtering out certain sales reps or ignoring prices within a specific range in the same query. We’ve also added better control for time based analytics So you can now, for example, set the time zone when using time functions. So metrics like purchase by hour reflect your actual business hours.
So that’s a high level on those first few items. Now I wanna bring Thomas in who is a senior engineer at Vespa, to dig in a little bit deeper about our faster lexicon search queries and our HNSW improvements.
Thomas?

Thomas Thoresen:
Yeah. Thank you, Bonnie.
So lexical search is a cornerstone for most of our customers up applications. And at the core of lexical search is our weak end query operator. Weekend is designed for a less strict and matching with improved performance. It allows you to combine multiple conditions, terms, across different fields, retrieving documents that may not match all terms, but at least some of them.
And although weekend is very efficient by itself, it may retrieve or match many documents, especially given a lot of common terms for the query.
And since all the matched documents will be exposed to our first phase ranking stage, the sometimes it might make sense to not to use or filter out which terms are used, so that the queries do not get slower than acceptable.
So to mitigate this, we have added some new configuration parameter that allow you to control this in more detail.
The first one of these new parameters is the stop word limit. As an example, if you set the value for this to 0.6, this will drop all query terms that occur in at least 60% of the documents. This, the reasoning behind is that these do not bring much value to the to the query.
The second one is the adjust target. This works on documents. And as an example, if you set this to 0.01,v this excludes the documents that only have terms that occur in more than approximately 1% of the document core. The actual threshold is query specific and based on the query term score whose document frequency is closest to 1%. And the third one is the filter threshold. This allows you to use more compact posting lists for common terms. So if you set the filter threshold to 0.05, this ensures that all terms that are estimated to occur in more than 5% of the documents are handle handled with compact posting lists, represented as bit vectors. Instead of the full posting lists. This makes, matching a lot faster at the cost of produce less information for BM 25 ranking as only a Boolean, signal is available.
And, last, we also added the option to allow drop all.
The default behavior of the weekend operator is to always keep at least one term even though all are considered stop words. This is to avoid returning no hits.
But if combining with a semantic search in the hybrid search setting, it may often be desirable to actually drop all weekend terms and rather to rely purely on the semantic matches for these cases.
If you set the allow drop all, this will be enabled.
And if we move on to the HNSW improvements, we also have, a lot of goodies for you here.
So Vespa has supported the approximate nearest neighbor search, with HNSW, Or hierarchical navigable small worlds algorithm, for more than five years.
This year, we added a significant improvement to our HNSW implementation.
The issue that we sought to resolve were for queries that combine HNSW search with a filter, such as, for example, filtering on product price range or a category if you’re talking about an ecommerce product search.
Before this recent improvement, these queries were processed as follows. While you’re performing an HNSW search, if a document with a low distance to the query is found, you check whether it passes the filter. And if it does, add it to the candidate queue.
And this works well if most the documents pass the filter.
But in some cases, if the filter is very restrictive, a large part of the edge and SW graph is traversed and a lot of, this vector distances are computed. Mostly of them are for documents that do not even pass the filter.
This can be observed observed in the plot to the right. Where you can see that the response time spikes when the filter gets very restrictive. The solution is implementing an algorithm, which, is is inspired by the Acorn paper published last year.
And the approach is that instead of computing the distance for all of all neighbors, of a node to the query and then checking which of the relevant neighbors passed the filter.
Like we did before. We take the neighbors and the two hop neighbors of the node, and for those that pass the filter then, computes their distance to the query vector to see if they’re relevant. So the threshold for threshold for which or when this should apply is, controlled by the new filter first threshold parameter.
And, you might note there or, you should note that for some hit ratios, which is, the strength of the filter, this might produce a dip in recall. This can be mitigated by checking the three hop neighbors of a candidate mode.
But this, again, can get very costly for some nodes.
Yeah. If you multiply two, two by 16, raised to the power of three, you get 32,768 candidates to be explored. Thus, we want to only check the three hop neighbors for nodes where the neighborhood is not too large.
How this is done is controlled by the filter first exploration parameter.
And this actually allows one to finally control the trade off between response time and recall for, approximate nearest neighbor search with filters.
All of these parameters can also be passed query time for easier experimentation.
Yes. That, wraps up the HNSW improvements, and, you can move on to the next slide, Bonnie.

Bonnie Chase:
Great. Thanks, Thomas.
Now another huge theme for 2025 was better retrieval and ring ranking for LLM based applications. We added integrated chunking, chunk level scoring, layered ranking, and full transparency in the global phase. And these make RAG pipelines significantly more accurate and easier to reason about.
I’ll go ahead and start with the first one, which is global phase relevance score. Now when Vespa ranks results, it gives each document a score that reflects how relevant it is. And that score is created early in the ranking process.
With the latest update, that relevance score is now also available at the final stage where Vespa combines results from all nodes and decides the final order.
We also added support for several newer high quality modern BERT embedding models. Now these models generate richer, more accurate vector representations of text, which directly improves semantic search, reg, and recommendation quality. Now on Vespa Cloud, the using these models is simple. You just reference them in your services XML file, and VESTA takes care of hosting, scaling, and serving the models for inference.
We also, created compact tensor representation. Now as you know, sparse and dense vectors are commonly mixed and used for embeddings and semantic features. These tensors can become large, and the traditional JSON format adds significant overhead. But with our latest update, you can start using a compact text representation that dramatically reduces serialized size and eliminates float by float JSON parsing. The result is faster ingestion ingestion, lower latency, better efficiency, and improved overall performance. Now I’ll hand it over to Thomas to dig in deeper about layered ranking.

Thomas Thoresen:
Thank you. Layered ranking is a really, really interesting concept, which, is quite, valuable to several of our customers. Now I’ll try to explain why. So if prompt engineering was the new word of 2024, context engineering has become even more important in 2025.
This term is used to illustrate the process of engineering the right context in form of tokens or representations of words which will make the LLMs more efficient at fulfilling their task. It has been shown in research that LLMs perform significantly worse if they are provided with irrelevant context.
In addition, LLM inference is is commonly priced per token, and the inference time increases quadratically with the number of tokens it gets in the input. This is, why it’s so important to provide the relevant documents and information when retrieving documents to pass through an to an LLM.
Additionally, when embedding documents for semantic search, chunking has become a common approach.
To limit the amount of information to embed at once. As if you embed too long texts, it will make it harder to capture the inherent information and thus to retrieve the relevant text.
When chunking documents, you are forced to choose one of two approaches when you’re modeling and indexing your documents. Option one is to index at the chunk level.
But this again will force you to duplicate all the relevant document metadata that may also influence the relevance of this chunk.
The option, two is to index at the document level. With the chunks as an array field of the document.
But this has the significant drawback that you only rank at document level. Then you will need to return all the chunks of the document to an LLM. Even though maybe only a single or a few chunks of that document is actually relevant.
Vespa has now figured out a way to get you the most both best of both worlds And, personally, I think this is a really underappreciated feature that deserve even more recognition.
Maybe it will, will as more people will use it. But certainly, our customers doing rag at web scale already use and praises it.
So the approach enabled by Vespa is to choose the index documents with chunks as an array field of the document.
This solves the duplicate metadata issue.
And then in order to solve the issue of returning all the chunks to the LM, we have introduced a way to specify how the elements to be selected and returned should be chosen.
And the cool thing about this is that you this can be a summary feature. As calculated by a rank expression, allowing our customers to use both flexible scores, semantic scores, or any other chunk level attributes and combine these through a tensor expression however they like.
This is incredibly powerful.
And I think many have yet to discover this.
But those that do are sure to have a real edge as I have not seen any other Vector DB or search engine providing a real solution to this problem.

Bonnie Chase:
Yeah. Great. Yeah. It sounds very flexible and very powerful. Thanks, Thomas.
Now we also, invested heavily in developer product You know, we have a new Python query builder, PyVespa relevance and match evaluators, and summary features inheritance. Now these accelerate experimentation and reduce time to market.
And I’ll explain more at a high level. Starting with the new query builders. So with this new query builder, Python developers can now also generate YQL, and this makes it easier to create valid VESPA queries in code.
We also have the Vespa match evaluator, which is a helper helper tool in Pipedema that lets you measure how well your search configuration is matched the documents it should match.
With the summary features inheritance update, when you inherit from multiple rank profiles, Vespa now also carries over their summary features, which gives you more flexibility and consist consistency in building complex ranking strategies without redundant definition. So before this update, if you created a combined rank profile, Vespa didn’t inherit the summary features from both parents. You had to redeclare them manually. But now you can simply inherit, and Vespa will include all of those summary features from both parents.
With the c I CLI multi get update, you can get a list of documents And this update was actually submitted by one of our customers. So, again, you know, we’d love to hear your feedback,
Feel free to join our Slack community, contribute you know, your feature requests, things that you’d like to to see in the product, and they can end up on, on the latest product update.
Now getting data into and out of systems is often time consuming and challenging. But with the Vespa’s LogStash plugins, this is much easier. So this can be, you know, importing data from a CSV file, post gres, Elasticsearch, self hosted to Vespa Cloud. We do have a blog post about this, but yeah, this makes things much easier for getting, getting data in and out.
And, I’ll hand it over to Thomas to talk about the RAD Blueprint.

Thomas Thoresen:
So the reason why we made this rag blueprint was that we had worked with many customers which were doing rag, and, there was a team that we noticed. Of common challenges that they were facing and questions that they were asking. Some of the challenges were overcome by advising them on how to use existing Vespa features. And for some of these, we also had to develop new features such as, for instance, the ElementWise b m 25, ranking feature built in chunking. And, the layer ranking framework that I, talked about earlier. So John, our CEO, he first gathered as series of recommendations into a slide deck.
But you we also wanted to have a practical reference implementation of an application that, showed how this could be combined.
And this was, the reason why the Ragged Blueprint was born.
During development, we noticed that some generic evaluation tooling that would be very valuable to our users. So, we added the Vespa evaluator and match evaluator, which was mentioned here. To our Python SDK.
These classes make it easier to, do quality evaluation of a less application across a variety of common information retrieval metrics.
So all you need to do this is a set of queries as well as labels for the relevant document IDs for each query.
In addition, we often encourage to make use of Vespa’s unique faced ranking framework.
The rag blueprint implements this.
We have a cheap first phase ranking, which is a learned linear combination of semantic and lexical features. For the second phase, we show how one can use the learning to rank a approach to train a gradient boosted decision tree model for ranking. Which is still the go to approach for those doing learning to rank at scale. Due to its efficiency and good quality.
And last but not least, the RAG blueprint exemplifies how one can use the tensor expression to choose which of the down chunks that should be returned for the top ranking documents. While still taking into account document level metadata. Such as recency or popularity for document rank.
The rag blueprint has since become a resource we often reference and we firmly believe that the understanding the contents of this is a great resource for anyone looking to build quality RAG applications at scale.

Bonnie Chase:
Great. Yeah. And and this RAG blueprint, I think, is really cool because this is something that we also developed alongside some of our customers like Perplexa to really ensure that we have the best practice for a high quality scalable rag system and we’ll provide the the resources for this as well. We have the blueprint that walks you step by step through the process of building We have a video that kinda gives you the high level overview. And then we have those kind of key best practices that that Thomas mentioned as a separate asset if you’re not ready to build and you just want to kind of see what those best practices are. So lots of good resources on this. Very excited about this rag blueprint.
Now now that covers at a high level the feature updates for 2025. Now we did move quickly today so that we could cover everything.
Now if you want to learn more, you can get updates about the latest changes to Vespa events or things that we hope you find useful in the Vespa newsletter. The links are provided here. We can also drop them in the resources button on the right hand side. And also feel free to join our Slack community.
And if you are a VESPA user, we invite you to share your experience on g two. We won’t be collecting your data or sharing your your info information. This is really just to, you know, leave leave a quick rating, tell others how Vespas helped improve your search or AI workflows. We’d love to hear your story.
And this this webinar really was kind of our kickoff for having a regular cadence of product updates. So next year, you’ll start seeing quarterly updates on what we’ve been, releasing and and updating in Vespa. So we’re excited to to start sharing that. So keep an eye out for the invitation to those webinars. With that, once again, thank you. Thank you for joining us. Thank you, Thomas, for providing details and answering questions, and we hope to see you at the next webinar.
Thanks, everyone.

Questions and answers from the end of the session.

Where can I find guidelines on which AI models to use when I want to stay within Vespa?
You can find this information in our documentation here.
Will there be a equivalent SDK for Java?
You can find that here.
What is the best/fastest way to learn about all the different options / features vespa offers related to Lexical and Vector search?
Watch this video and follow along in Github. You can also learn more in our documentation.
How do we get started with layered ranking?
The best approach is to start with our RAG blueprint.

On what version were the lexical improvements released?
Version 8.503.27
Can you share benchmarking results for the lexical search improvements?
You can read about the lexical search improvements and benchmarking in this blog.
Are there plans to include k8s native features for the self-hosted version?
We are still in early stages, but there is definitely work being, done in this area. We’ll share more in Q1 of next year.
Is there a plan for adding support for 2GB+ ONNX models next year?
We still have work to do in terms of supporting ONNX models that are used for ranking (the ONNX models that need to be this distributed to every content node), but for embedding models, which typically are a bit, bigger, this is actually now supported. You can read more here.