Big Data: ElasticSearch Datastore in Action

By |Published On: July 31st, 2017|

ElasticSearch is an open-source, distributed, RESTful, search engine. ES (ElasticSearch) uses JVM and is built on top of Apache Lucene. ES is great for indexing large amounts of data, sifting through a large result set, and analyzing data.

ES can store up to 2.1 billion documents or 274 billion distinct terms in an index. This is awesome, however, there are some important things to be aware of before you start importing records (known as “documents” in ES). One of those things is that the primary shards must be set before creating the index.

Unfortunately, I learned this lesson the hard way! One of my ES indexes contained 1.6 billion documents and was starting to cause issues. So, if you’re importing billions of records please plan accordingly and add more primary shards then the default.

Note: The great thing about ElasticSearch is that it is JSON over HTTP. This has the advantage of allowing multiple programming languages easily talk to an ElasticSearch datastore.

In this blog post I will demonstrate how to import documents via the _bulk API module. I will also show you how to communicate with your ElasticSearch datastore in python using the ElasticSearch library. After you should be able to create a valid ES JSON file, import a large set of documents, and be able to manipulate them in Python.

EXAMPLE 1: IMPORTING RECORDS VIA THE _BULK API MODULE

Document importing is relatively fast with the _bulk API module. Let’s say you have security books that you want to index. In this example, I will demonstrate how to import these documents via the bulk API.

Copy to Clipboard

Parse data into this format. ES is efficient if all fields are the same in an index.

Then use the curl command to import the file into ES:

Copy to Clipboard
Copy to Clipboard

If the import is successful, you will get “result”: “created” for each document. Each JSON file should have between 2 – 80,000 lines. Do not go over 10MB per file. The file is too big and will be rejected.

Bash is an excellent tool to import multiple files into ES. If you have one massive JSON file, split it up into parts. Then import it with this bash code:

Copy to Clipboard

The /dev/null takes the output and throws it away rather than printing it on the screen. This makes importing much faster. Careful with this approach as you may lose errors!

Visit the index in the browser to verify everything has been imported.

http://localhost:9200/books/security/_count

The _count API will show how many documents are currently in the books/security index.

EXAMPLE 2: WORKING WITH ES IN PYTHON.

In this example, I will demonstrate how to use python to search our ES datastore.

Use pip install ElasticSearch to install the python ES library.

Copy to Clipboard

If you run this code it will return:

Copy to Clipboard

The search query “Justin Seitz” returned one document. The search query “2017” returned two documents. Since I didn’t specify a doc_type (ours is security), it searched the entire ES datastore. Anything is searchable in ElasticSearch.

Copy to Clipboard

Troubleshooting

If you’re new to ES (especially Amazon Web Services ES) you may run into a few errors.

If you see an error like this:

Copy to Clipboard

It could be because the ES disk space is full. Expand the shard space and/or add more nodes.

If you’re using Amazon AWS ES and you see this error:

Copy to Clipboard

Whitelist your IP in Amazon. Your IP is not allowed to communicate with ES.

IN CONCLUSION

Now you know how to create valid ES JSON files, import large amount of documents into ES using the bulk _API, and use Python to manipulate your ES datastore. This blog post will help get you on your way to managing your very own ElasticSearch datastore.

ElasticSearch is the perfect free product to store and analyze big data. It has a vibrant community and ES is still being improved daily. You can also update ES to a newer version without losing all your data. It is highly scalable and manageable.

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.