Feature overview

Vespa is a complete platform for applications combining data and AI, online. By building such applications on Vespa you avoid any boring integration work to get all the features you'll need, and you can be sure it will scale to any amount of traffic and data with good performance in production. To deliver that, Vespa provides an uniquely broad range of query capabilities, a powerful computation engine with great support for modern machine-learned models, hands-off operability, data management and application development support, and unbeatable performance and scalability.

Query capabilities in Vespa

Vespa supports querying by:

  • Vectors: Nearest neighbour, approximate (ANN) or exact, with a variety of distance metrics.
  • Structured data: Exact, substring and regexp, numerical ranges, geo-distance, predicates.
  • Text: Full text with tokenization and stemming, as well as positional information.

A document can have any number of fields of these types that can be queried efficiently in the same query, and each field - even vector fields - can have multiple values in the same document.

Operators querying fields in these ways can be combined freely by AND and OR in queries while still executing efficiently. This makes it possible to express a wide range of behavior, such as retrieving in vector spaces with filters or performing hybrid text and semantic retrieval.

Example query tree with ANN and filter terms

In addition, Vespa provides a grouping language which lets queries specify how matches should be grouped and aggregated. This makes it possible to group matches by unique field values or buckets, and aggregate, count or compute values over the groups. Groupings can be arbitrarily deep and multiple groupings can be made in the same query. Even though grouping applies to all matches across all nodes, it executes efficiently as a distributed computation.

Vespa's computation engine

Once data is selected by a query most applications need to compute over each data element. In a search application this can be a relevance model deciding which of all the matches should eventually be surfaced, in a recommendation system it's a recommender deciding which candidate items to recommend. More generally, it can be any set of numerical computations using a matched document item and query as input. The result of such computations can be used to prioritize which items to return (ranking), or just be returned with the matches.

Computation as mathematical expressions

Computation in Vespa is specified as mathematical expressions over scalars and tensors. These expressions may be written by hand or imported into Vespa from a machine learner (TensorFlow, LightGBM, XGBoost, or ONNX-compatible tool). Computations over data happens locally on the nodes storing the data, which is important because it allows as much parallel computation as there are data nodes and avoids sending the data over the network for computation, something which can not scale.

Model deployment to content nodes

The features these expressions compute over are either fields of the documents, values passed with the query, data from the application package, or a selection from the wide range of built-in features in Vespa combining data from both queries and documents, such as e.g. text match features.

Sparse and dense tensors

Tensors are generalizations of vectors (1-d) and matrices (2-d) to any number of dimensions. Since features in Vespa can be tensors, and the expression language provides operators for computing over tensors, it is possible to express a wide range of computation over collections of numbers to for example perform inference in modern transformer based language deep learning models.

Tensor dimensions in Vespa can - uniquely - be either sparse or dense, and tensor computations work the same with both kinds. Sparse dimensions can accept any string as keys. This makes it possible to represent both array-like and map-like data as tensors - for example a map of vectors - and still compute efficiently, which greatly increases the range of computation that can be done with tensors in Vespa compared to most other tools supporting tensors.

Multi-phase inference

Multi-phase inference can be used to invest more computational power in the best candidate items by doing an initial selection using a simpler model, followed by computing a more expensive model for the best candidates.

Operability and data management in Vespa

Vespa is engineered to be safe and easy to operate at scale. While operating distributed stateful systems to deliver consistent low latency and high availability in all scenarios is never easy, and you should consider using the Vespa Cloud to provide that for you, Vespa automates all routine tasks such that they can be performed safely in production with no service disruption and little human involvement. This includes tasks involved in data and node management, system configuration and application development.

Data management

Data in Vespa is automatically distributed over available nodes in the cluster, in the configured redundancy factor, with no manual decision-making needed. Nodes can be added to (or removed from) clusters at any time: Vespa will redistribute content in the background without impacting query or write traffic.

Growing a cluster in two dimensions

This functionality is based on the CRUSH algorithm and ensures that content will be near uniformly distributed over nodes while also ensuring minimal data redistribution when there are changes to the set of cluster nodes.

Node management

Nodes in Vespa are automatically monitored and routed around should they fail. In stateful clusters this includes redistributing data in the background to rebuild the configured redundancy level when a node fails.

Document redistribution after a node fail

The upshot is that Vespa keeps working when nodes fail, with no need for manual operations, except to occasionally add capacity replacing the failed nodes.

System configuration

A Vespa system instance is a collection of clusters which realizes the functionality of an application. Vespa instances are always completely configured by a high level specification of the system - the application package - while the detailed configuration of the nodes and processes involved are done automatically by Vespa itself. This makes it easy to create Vespa systems of any size and hard to create an incorrect configuration.

Application development

Any change to a Vespa system is made simply by changing the application package and deploying it to Vespa:

Application package deployment to multiple nodes

Vespa will apply most changes immediately without impacting queries or write traffic. If a change is potentially dangerous, or requires further actions such as restarting nodes, Vespa will specify this such that the operator can decide what to do and carry out those actions. On the Vespa cloud these operations are also automated, providing full, safe continuous deployment for all changes.

Performance and scalability

Vespa is engineered to perform computations over data in less than one hundred milliseconds and scale to any data size and query load. Due to the large scale of some of the systems Vespa is running, it has been cost-effective to invest a large number of man-years into optimizing Vespa at all levels from fundamental architectural choices to ensure parallelization to low-level optimization of core algorithms and data structures to execute without cache misses.

Distributed and scalable computation

Most computation happens on the nodes storing the content. This ensures that the amount of computation that can be done in fixed time scales with the number of content nodes, and that network bandwidth never becomes a bottleneck.

Query fanout to multiple nodes

If you can express a computation in Vespa you, can be sure that you can scale it nearly indefinitely to higher data volumes, lower latency, and higher query load by adding more nodes.

Efficient query execution

Vespa uses a variety of index structures to be able to find the data specified by a query quickly: Dictionaries and posting lists for text, B-trees for structured data, and HNSW indexes for vectors. The HNSW algorithm in Vespa has been modified to support true realtime updates and support for efficient searching with query filters.

While the stateless container nodes in Vespa are written in Java, the content nodes that store and compute over data are implemented in C++ to be able to efficiently manage any amount of memory and make use of hardware specific optimizations. Queries are executed in parallel over content nodes, but parallelization does not stop there. Vespa will also parallelize a query over a configurable number of cores on each node to bring down latency further.

Real-time and high throughput writes

In addition to queries, write performance is also important in most applications. Vespa can deliver a write-throughput of tens of thousands of operations per second per content node in most cases, and this write-load can be sustained indefinitely, and does not impact query execution, apart from spending a fixed fraction of the nodes resources. All writes are applied in real time. Vespa achieves this by distributing writes to nodes using a distributed, async messagebus, and applying them using lock-free in-memory data structures only, in addition to a persisted transaction log for replay. Disk structures are maintained in background with the goal of spreading the load of maintenance evenly over time to maintain stable resource usage. In addition to adding, changing and removing entire documents, Vespa supports updating just selected fields, which can be extremely cheap when these fields hold structured values.