Migrating from Elasticsearch

This is a guide for how to move data from Elasticsearch (ES) to Vespa. By the end of this guide you will have exported documents from Elasticsearch, generated a deployable Vespa application package and tested this with documents and queries.

Take a look at the Elasticsearch / Solr / Vespa comparison, and review next steps for how to optimize for Vespa's features.

ES_Vespa_parser.py is provided for conversion of Elasticsearch data and index mappings to Vespa data and configuration. It is a basic script with minimal error checking - it is designed for a simple export, modify this as needed for your application's needs.

Feed a sample ES index

Set up an index with 1,000 sample documents using getting-started-index or skip this part if you have an index:

Run Elasticsearch and wait for Elasticsearch to start:

$ docker network create --driver bridge esnet

$ docker run -d --rm --name esnode --network esnet -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:7.10.2

$ while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:9200)" != "200" ]]; do sleep 5; done #wait for ES to start

Download test data, and feed the data to running Elasticsearch instance:

$ curl 'https://raw.githubusercontent.com/elastic/elasticsearch/7.10/docs/src/test/resources/accounts.json' \
  > accounts.json

$ curl -H "Content-Type:application/json" --data-binary @accounts.json 'localhost:9200/bank/_bulk?pretty&refresh'

Verify that the index has 1,000 documents:

$ curl 'localhost:9200/_cat/indices?v'

Dump documents from Elasticsearch

This guide uses ElasticDump to dump the index contents and the index mapping.

Dump the documents and mappings, and delete the docker network and running Elasticsearch container:

$ docker run --rm --name esdump --network esnet -v "$PWD":/dump -w /dump elasticdump/elasticsearch-dump \
--input=http://esnode:9200/bank --output=bank_data.json    --type=data
$ docker run --rm --name esdump --network esnet -v "$PWD":/dump -w /dump elasticdump/elasticsearch-dump \
--input=http://esnode:9200/bank --output=bank_mapping.json --type=mapping
$ docker rm -f esnode
$ docker network remove esnet    

Generate Vespa documents and Application Package

Use ES_Vespa_parser.py to generate Vespa documents and configuration:

$ curl 'https://raw.githubusercontent.com/vespa-engine/vespa/master/config-model/src/main/python/ES_Vespa_parser.py' \
  > ES_Vespa_parser.py

$ python3 ./ES_Vespa_parser.py --application_name bank bank_data.json bank_mapping.json

This generates documents in documents.json (see JSON format) where each document has IDs like this id:bank:_doc::1. It also generates a bank folder with a Vespa application package.

/bank
      │
      ├── documents.json
      ├── hosts.xml
      ├── services.xml
      └── /schemas
            ├── _doc.sd
            └── ...

Deploy Vespa

Install Vespa CLI. In this example we use Homebrew, you can also download a vespa-cli release from GitHub.

$ brew install vespa-cli

Set cli target environment, it's possible to deploy to Vespa Cloud using target cloud. For local deployment using docker image use :

$ vespa config set target local

For cloud deployment using Vespa Cloud, use:

$ vespa config set target cloud
$ vespa config set application tenant-name.myapp.default
$ vespa auth login 
$ vespa auth cert

See also Cloud Vespa getting started guide. It's possible to switch between local deployment and cloud deployment by changing the config target

Run the Vespa container. We forward the deploy api port (19071), and the data plane serving port (8080).

$ docker run -m 4G --detach --name vespa --hostname vespa-es-tutorial \
  --publish 8080:8080 --publish 19071:19071 vespaengine/vespa

Verify that configuration service (deploy api) is ready:

$ vespa status deploy --wait 300

Deploy the application package

$ vespa deploy --wait 300 bank

Wait for the application endpoint to become available:

$ vespa status --wait 300

Index the documents we dumped from Elasticsearch. Download Vespa feed client:

$ FEED_CLI_REPO="https://repo1.maven.org/maven2/com/yahoo/vespa/vespa-feed-client-cli" \
    && FEED_CLI_VER=$(curl -Ss "${FEED_CLI_REPO}/maven-metadata.xml" | sed -n 's/.*<release>\(.*\)<.*>/\1/p') \
    && curl -SsLo vespa-feed-client-cli.zip ${FEED_CLI_REPO}/${FEED_CLI_VER}/vespa-feed-client-cli-${FEED_CLI_VER}-zip.zip \
    && unzip -o vespa-feed-client-cli.zip

Index documents:

  $ ./vespa-feed-client-cli/vespa-feed-client \
       --file bank/documents.json --endpoint http://localhost:8080

Querying Vespa

Get a document using the Document API:

$ curl -s http://localhost:8080/document/v1/bank/_doc/docid/1

Use the Query API to count documents, find "totalCount":1000 in the output - then run a text query:

$ curl -H "Content-Type: application/json" \
  --data '{"yql" : "select * from sources * where true"}' \
  http://localhost:8080/search/

Run a simple query against the firstname

$ curl -H "Content-Type: application/json" \
  --data '{"yql" : "select firstname,lastname from sources * where firstname contains \"amber\""}' \
  http://localhost:8080/search/

It's also possible to query using the vespa cli:

$ vespa query \
    'yql=select firstname,lastname from sources *  where true' 

Next steps

Review the differences in document records, Vespa to the right:

{
    "_index": "bank",
    "_type": "_doc",
    "_id": "1",
    "_score": 1,
    "_source": {
      "account_number": 1,
      "balance": 39225,
      "firstname": "Amber",
      "lastname": "Duke",
      "age": 32,
      "gender": "M",
      "address": "880 Holmes Lane",
      "employer": "Pyrami",
      "email": "amberduke@pyrami.com",
      "city": "Brogan",
      "state": "IL"
    }
}
{

    "put": "id:bank:_doc::1",


    "fields": {
      "account_number": 1,
      "balance": 39225,
      "firstname": "Amber",
      "lastname": "Duke",
      "age": 32,
      "gender": "M",
      "address": "880 Holmes Lane",
      "employer": "Pyrami",
      "email": "amberduke@pyrami.com",
      "city": "Brogan",
      "state": "IL"
    }
}

The id field id:bank:_doc::1 is composed of:

  • namespace: bank
  • schema: _doc
  • id: 1

Read more in Documents and Schemas.

The schema is the key Vespa configuration file where field types and ranking are configured.

The schema (found in _doc.sd) also has indexing settings, example:

search _doc {
    document _doc {
        field account_number type long {
            indexing: summary | attribute
        }
        field address type string {
            indexing: summary | index
        }
        ...
    }
}

These settings impacts both performance and how fields are matched. For example, the account_number above is using the attribute keyword, which makes the field available for sorting, ranking, grouping, but which by default does not have data structures for fast search. Read more in attributes and practical search performance guide

Notes

The Vespa docker image used in this guide is the same as used in production. Read more about Vespa components in the Vespa overview.

Tear down the Vespa container:

$ docker rm -f vespa