Migrating from Elasticsearch

This is a guide for how to move data from Elasticsearch (ES) to Vespa. By the end of this guide you will have exported documents from Elasticsearch, generated a deployable Vespa application package and tested this with documents and queries.

Take a look at the Elasticsearch / Solr / Vespa comparison, and review next steps for how to optimize for Vespa's features.

ES_Vespa_parser.py is provided for basic conversion of ES data and mappings to Vespa data and configuration. It is a basic script with minimal error checking - it is designed for a simple export, modify this as needed for your application's needs.

Feed a sample ES index

Set up an index with 1,000 sample documents using getting-started-index or skip this part if you have an index:

$ docker network create --driver bridge esnet

$ docker run --rm --name esnode --network esnet -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:7.10.2

# Let Elasticsearch run while loading and dumping, open a new window for the below:

$ curl 'https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json' \
  > accounts.json

$ curl -H "Content-Type:application/json" --data-binary @accounts.json 'localhost:9200/bank/_bulk?pretty&refresh'

$ curl 'localhost:9200/_cat/indices?v'

The last command should indicate 1,000 documents in the index.

Dump documents from ES

Refer to ElasticDump for details.

$ cat > dumpit.sh << EOF
npm install elasticdump
/dump/node_modules/.bin/elasticdump --input=http://esnode:9200/bank --output=bank_data.json    --type=data
/dump/node_modules/.bin/elasticdump --input=http://esnode:9200/bank --output=bank_mapping.json --type=mapping
EOF

$ docker run --rm --name esdump --network esnet -v "$PWD":/dump -w /dump node:alpine sh dumpit.sh

$ docker network remove esnet   # Stop Elasticsearch in the other container before removing the network

Generate Vespa documents and Application Package

Use ES_Vespa_parser.py to generate Vespa documents and configuration:

$ curl 'https://raw.githubusercontent.com/vespa-engine/vespa/master/config-model/src/main/python/ES_Vespa_parser.py' \
  > ES_Vespa_parser.py

$ python3 ./ES_Vespa_parser.py --application_name bank bank_data.json bank_mapping.json

This generates documents in documents.json (see JSON format) where each document has IDs like this id:bank:_doc::1. It also generates a bank folder with a Vespa application package.

/bank
      │     
      ├── documents.json
      ├── hosts.xml
      ├── services.xml
      └── /schemas
            ├── _doc.sd
            └── ... 

Deploy and test

This tutorial have been tested with a Docker container with 6GB RAM. Start the Vespa container:

$ docker run -m 6G --detach --name vespa --hostname vespa-es-tutorial \
  --privileged --volume `pwd`:/app \
  --publish 8080:8080 --publish 19112:19112 vespaengine/vespa

Wait for the configuration server to start - wait for a 200 OK response:

$ docker exec vespa bash -c 'curl -s --head http://localhost:19071/ApplicationStatus'

Deploy the application package:

$ docker exec vespa bash -c '/opt/vespa/bin/vespa-deploy prepare /app/bank && \
  /opt/vespa/bin/vespa-deploy activate'

Ensure the application is active - wait for a 200 OK response:

$ curl -s --head http://localhost:8080/ApplicationStatus

The Vespa node is now configured and ready for use.

Feed documents using the vespa-http-client:

$ docker exec vespa bash -c 'java -jar /opt/vespa/lib/jars/vespa-http-client-jar-with-dependencies.jar \
  --verbose --file /app/bank/documents.json --host localhost --port 8080'

Get a document using the Document API:

$ curl -s http://localhost:8080/document/v1/bank/_doc/docid/1

Use the Query API to count documents, find "totalCount": 1000 in the output - then run a text query:

$ curl -H "Content-Type: application/json" \
  --data '{"yql" : "select * from sources * where sddocname contains \"_doc\";"}' \
  http://localhost:8080/search/

$ curl -H "Content-Type: application/json" \
  --data '{"yql" : "select * from sources * where firstname contains \"amber\";"}' \
  http://localhost:8080/search/

Next steps

Review the differences in document records, Vespa to the right:

{
    "_index": "bank",
    "_type": "_doc",
    "_id": "1",
    "_score": 1,
    "_source": {
      "account_number": 1,
      "balance": 39225,
      "firstname": "Amber",
      "lastname": "Duke",
      "age": 32,
      "gender": "M",
      "address": "880 Holmes Lane",
      "employer": "Pyrami",
      "email": "amberduke@pyrami.com",
      "city": "Brogan",
      "state": "IL"
    }
}
{

    "put": "id:bank:_doc::1",


    "fields": {
      "account_number": 1,
      "balance": 39225,
      "firstname": "Amber",
      "lastname": "Duke",
      "age": 32,
      "gender": "M",
      "address": "880 Holmes Lane",
      "employer": "Pyrami",
      "email": "amberduke@pyrami.com",
      "city": "Brogan",
      "state": "IL"
    }
}

The ID field id:bank:_doc::1 is composed of:

  • namespace: bank
  • document type: _doc
  • id: 1

Use namespaces for data management (normally not needed, hard code one for now). Namespace is not used in the query API - query routing is per document type. A document type has a schema and Vespa queries all document types by default. Read more in Documents and Schemas.

All document types are stored in the same content cluster by default. Many applications have only one document type. Read more in Multiple Document Types.

The Schema is the key Vespa configuration file - here, field types are configured, and ranking expression defined. The ranking expression is the computation over the corpus when running a query. The default is Text Search (example ...where firstname contains "amber" above), but the ranking expression can be any mathematical expression. Read more in Ranking.

The schema (found in _doc.sd) also has indexing settings, example:

search _doc {
    document _doc {
        field account_number type long {
            indexing: summary | attribute
        }
        field address type string {
            indexing: summary | index
        }
        ...
    }
}

Read more in Indexing.

Notes

The Vespa docker image used in this guide is the same as used in production serving instances. It is the full Vespa application running, with all features. Read more about Vespa components in the Overview.