ElasticSearch is an open-source, distributed, RESTful, search engine. ES (ElasticSearch) uses JVM and is built on top of Apache Lucene. ES is great for indexing large amounts of data, sifting through a large result set, and analyzing data.
ES can store up to 2.1 billion documents or 274 billion distinct terms in an index. This is awesome, however, there are some important things to be aware of before you start importing records (known as “documents” in ES). One of those things is that the primary shards must be set before creating the index.
Unfortunately, I learned this lesson the hard way! One of my ES indexes contained 1.6 billion documents and was starting to cause issues. So, if you’re importing billions of records please plan accordingly and add more primary shards then the default.
Note: The great thing about ElasticSearch is that it is JSON over HTTP. This has the advantage of allowing multiple programming languages easily talk to an ElasticSearch datastore.
In this blog post I will demonstrate how to import documents via the _bulk API module. I will also show you how to communicate with your ElasticSearch datastore in python using the ElasticSearch library. After you should be able to create a valid ES JSON file, import a large set of documents, and be able to manipulate them in Python.
EXAMPLE 1: IMPORTING RECORDS VIA THE _BULK API MODULE
Document importing is relatively fast with the _bulk API module. Let’s say you have security books that you want to index. In this example, I will demonstrate how to import these documents via the bulk API.
Parse data into this format. ES is efficient if all fields are the same in an index.
Then use the curl command to import the file into ES:
If the import is successful, you will get “result”: “created” for each document. Each JSON file should have between 2 – 80,000 lines. Do not go over 10MB per file. The file is too big and will be rejected.
Bash is an excellent tool to import multiple files into ES. If you have one massive JSON file, split it up into parts. Then import it with this bash code:
The /dev/null takes the output and throws it away rather than printing it on the screen. This makes importing much faster. Careful with this approach as you may lose errors!
Visit the index in the browser to verify everything has been imported.
The _count API will show how many documents are currently in the books/security index.
EXAMPLE 2: WORKING WITH ES IN PYTHON.
In this example, I will demonstrate how to use python to search our ES datastore.
Use pip install ElasticSearch to install the python ES library.
If you run this code it will return:
The search query “Justin Seitz” returned one document. The search query “2017” returned two documents. Since I didn’t specify a doc_type (ours is security), it searched the entire ES datastore. Anything is searchable in ElasticSearch.
If you’re new to ES (especially Amazon Web Services ES) you may run into a few errors.
If you see an error like this:
It could be because the ES disk space is full. Expand the shard space and/or add more nodes.
If you’re using Amazon AWS ES and you see this error:
Whitelist your IP in Amazon. Your IP is not allowed to communicate with ES.
Now you know how to create valid ES JSON files, import large amount of documents into ES using the bulk _API, and use Python to manipulate your ES datastore. This blog post will help get you on your way to managing your very own ElasticSearch datastore.
ElasticSearch is the perfect free product to store and analyze big data. It has a vibrant community and ES is still being improved daily. You can also update ES to a newer version without losing all your data. It is highly scalable and manageable.