Elasticsearch – Vagisha’s blogs

Hey, all! I recently started using Elasticsearch and I have to tell you this, I love it already! So here’s this blog focusing on importing/indexing JSON file to Elasticsearch.

If you have a lot of documents to index, you can use the Bulk API of Elastic search to send them in batches. However you need to follow bulk format to successfully execute it otherwise you might come across "Malformed content, found extra data after parsing: START_OBJECT" error. The Bulk API expects the following newline delimited JSON (NDJSON) structure:

{ "index" : { "_index" : "test", "_id" : "1" } } 
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_id" : "2" } }
{ "field2" : "value2" }

Checkout the link for further reference on it.

Now, you can use jq tool to change your JSON file into the bulk format on command line. jq is an extremely amazing command–line JSON processor. To use it, first make sure you have jq installed.

On Debian systems you can install it via sudo apt-get install jq
On MacOS, you can install it via brew install jq

Then, execute the following command to get a new JSON file in a bulk format.

cat info.json | jq -c '{"index": {"_index": "students", "_type": "doc"}}, .' > students.json

Here, we pipe the contents of info.json file with the -c option, to construct compact rather than pretty-printed JSON. So if your original info. json file something like this:

{"enrol_number": 1, "firstname": "Drake", "lastname": "Wilson", "age": 16, "gender": "M"}
{"enrol_number": 2, "firstname": "Scarlet", "lastname": "Rose", "age": 14, "gender": "F"}

The output gets written into a new file students.json:

{"index": {"_index": "students", "_type": "doc"}}
{"enrol_number": 1, "firstname": "Drake", "lastname": "Wilson", "age": 16, "gender": "M"}
{"index": {"_index": "students", "_type": "doc"}}
{"enrol_number": 2, "firstname": "Scarlet", "lastname": "Rose", "age": 14, "gender": "F"}

Note: Since we are not specifying any id, a document id is automatically generated. The next step is to index the data of students.json file into the students index using a _bulk request:

curl -XPOST "localhost:9200/students/_bulk?pretty&refresh" -H "Content-Type: application/json" --data-binary "@students.json"

This is how we JSON format our file the way Elasticsearch’s Bulk API expects it and post it into the Elasticsearch. Using the cat indices API, we get the information of the index like shard count, document count, deleted document count, primary store size, etc.

curl "localhost:9200/_cat/indices?v"

Thanks for reading! 🙂

Tag: Elasticsearch

Indexing bulk documents to Elasticsearch using jq