BlogSoftware Development

Elasticsearch and Near Real Time Analytics

By October 22, 2019 October 24th, 2019 No Comments

Recently, we got a request to create an analytics dashboard for the application’s data we have. The data was stored within PostgreSQL and our initial idea was to build a queries which would drive these dashboards.

Soon after we started working on this, we realized that this approach might not be the most ideal one. We ended up creating special tables to drive analytics, installing plugins to support spatial queries, and writing really complex queries which were not fast enough. Alternatively,  we even had to write multiple queries in order to support a single metric.

Our second approach was to build analytics periodically, but this isn’t real time or near real time so we didn’t go with it.

Finally, after some research, we realised that Elasticsearch could help us achieve real time analytics.

Elasticsearch, Search and Aggregations

What is Elasticsearch ?

Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene.

Integration to Elasticsearch is done through easy-to-use REST API.

Elasticsearch stores data as JSON documents, where documents are stored within an index. Subsequently, we can define an index as a collection of documents.

If we compare this to the SQL world, we can say that an index is to elasticsearch what a table is to a SQL database, and a document is to an index what a record/row is to a SQL table.

Elasticsearch is schemaless, meaning we can create an index without defining the fields that the documents will have. Elasticsearch will actually, behind the scene, create schema/mapping based on the data in the index.

However, the manual provision of mapping is most desirable as we can also specify what type of analysis we want on each field.

Create index and insert data

As already mentioned, Elasticsearch is exposed via REST API. We can use any HTTP client to communicate with Elasticsearch or Kibana Console which has some nice features as query autocomplete.

In order to create a new index we’ll execute   by providing   and/or   data in the request body.

Within the settings object, we define index specific settings. An example of these settings would be  .

Within mapping, we define the schema of our documents.

Note that we don’t need to provide any data in settings/mapping and defaults will be used.

For example:

In order to insert a document into an index we’ll execute

For example:

A response in this case would look like:

By default, elastic will create a random ID for each document but we can also provide an ID for each record by adding   into the request:

So, what is the key to getting analytics data from Elasticsearch? Well, it’s all about Search and Aggregations. First, we’ll use query to filter out only data we care about and then we’ll use aggregations to collect that data into meaningful analytics.

Querying Elasticsearch

Elasticsearch provides JSON based on DSL to define queries.

To compare with SQL, this SQL query:

in Elasticsearch will look like:

Key parts of Elasticsearch DSL would be query and aggs.

Query

Within query in Elasticsearch search request we define the data we want to fetch. To compare with SQL, this would be a place where we put all our conditions like   or

Aggs

This is the place where we define our aggregations.

Elasticsearch has a large spectrum of aggregate functions which make it easy to get all kinds of different analytics from the data set. Full documentation on Aggregations can be found on this link.

The best part is that due to the powerful search and flexibility of aggregations, we can use Elasticsearch to build an awesome analytics engine.

Hands On

Data preparation

Let’s imagine we work for a Bookstore and our manager, Jane, requested us to provide some analytics.

Let’s say our book data is stored within some relational database with tables for customer, book, audit, or something like this. Our audit table would hold information about book, customer, date when the book is ordered…

For example, it could look like this:

Our first step would be to transform/denormalize the data into something more Elasticsearch friendly.

We will create an index audit with mapping:

so our documents in elastic would look like:

Note that, unlike SQL in elasticsearch we prefer to denormalize data and store all the information we can within a document. This would allow us to be flexible enough and to perform fast queries. It is possible to setup relations between documents in an index but this will affect our performance.

However, feel free to use it if you find it necessary for data redundancy. In our case, it’s not likely that a book name, consumer name will change so we’ll keep it like this.

Ready to go

OK, now we have our data indexed so let’s see what our clients want to see.

Jane: I want to see which books are the most popular 

This is an easy one, we’ll just get the top N frequent books (by id in our case). We can achieve that by using terms aggregation. For example:

Terms aggregation is used to fetch unique values along with their count. Terms aggregation is bucket aggregation and in its response we’ll get a number of buckets where the key of each bucket will be a book.id in our case. As we specified   we’ll get a maximum of 10 buckets in response. The example response may look like:

Jane: Ok, but I just want those from last month

Sure, lets just add a filter for time field.

To filter out data by date we use Range query. Note that now we’re writing our query as part of a query object, not aggs. query part of an elastic query is used to filter out the data we want. Later, the filtered data is used in aggs. So, in our case we will first filter only the documents within the given range and then use the results of query in aggregations.

Jane: Can we get them grouped by category?

Well, yes. We can use sub-aggregations for such a case. Sub-aggregations allow us to aggregate data from the results of a previous aggregation. In our case, we will first aggregate by   to get top book categories, and then as  asub aggregation of categories we’ll add aggregation by  . This way we’ll get top books for each category.

We can add as many sub-aggregations as we want, as long as the aggregation we use supports sub-aggregations.

Jane: Nice, but can we get these results for each week/month/year?

To achieve this we’ll add date_histogram aggregation as our root aggregation. Date Histogram will bucket our data based on the interval we set. So, if we decide to split our data into buckets of months, we’ll set   and we’ll get buckets for each month. Now we can sub-aggregate each month bucket with analytics of interest. In our case we’ll aggregate top books for each month.

Example response would look like:

Jane: Nice, lets get some statistics on book prices

In this case we can use   or   aggregations. These aggregations are doing multiple statistics over numeric fields, for example: count, min, max, avg, sum…

where the response looks like:

Of course, it is possible to execute any of the aggregations separately.

A lot more

These are just a few simple examples on how to get some analytics from Elasticsearch but I think you get the idea. There is much more that can be done, and a lot more aggregations to explore. Just to mention a few:

  • Filter aggregation You could actually write your own filter to be an aggregation. Very useful when you want to bucket something in a few buckets but cannot achieve this with other aggregations.

  • Range aggregation You could specify ranges of interest to aggregate data. A good example would be aggregating users by age into a few buckets.

  • Geo aggregations There are a number of geo aggregations that you might find interesting. For example, you could use geo distance aggregations to aggregate stores based on city center.

  • A lot of other aggregations that could fit your case.

  • Elastic also has a way of managing relationships. There are also ways to aggregate data using relations.

Conclusion

As you can see, elastic is a very powerful tool for producing fast and flexible analytics. We used the output of aggregations to visualize our data and produce reports for our business team.

Use the search part for limiting the data you’ll aggregate. Use histogram aggregations to display trendings. Use Elasticsearch to build awesome analytics dashboards!