Recently, we got a request to create an analytics dashboard for the application’s data we have. The data was stored within PostgreSQL and our initial idea was to build queries that would drive these dashboards.
Soon after we started working on this, we realized that this approach might not be the most ideal one. We ended up creating special tables to drive analytics, installing plugins to support spatial queries, and writing really complex queries that were not fast enough. Alternatively, we even had to write multiple queries in order to support a single metric.
Our second approach was to build analytics periodically, but this isn’t real-time or near real-time so we didn’t go with it.
Finally, after some research, we realized that Elasticsearch could help us achieve real-time analytics.
Elasticsearch, Search and Aggregations
What is Elasticsearch?
Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene.
Integration to Elasticsearch is done through easy-to-use REST API.
Elasticsearch stores data as JSON documents, where documents are stored within an index. Subsequently, we can define an index as a collection of documents.
If we compare this to the SQL world, we can say that an index is to Elasticsearch what a table is to a SQL database, and a document is to an index what a record/row is to a SQL table.
Elasticsearch is schemaless, meaning we can create an index without defining the fields that the documents will have. Elasticsearch will actually, behind the scene, create schema/mapping based on the data in the index.
However, the manual provision of mapping is most desirable as we can also specify what type of analysis we want on each field.
Create index and insert data
As already mentioned, Elasticsearch is exposed via REST API. We can use any HTTP client to communicate with Elasticsearch or Kibana Console which has some nice features as query auto-complete.
In order to create a new index we’ll execute HTTP PUT {es_host}/{name_of_index} by providing settings and/or mapping data in the request body.
Within the settings object, we define index specific settings. An example of these settings would be number_of_shards or number_of_replicas.
Within mapping, we define the schema of our documents.
Note that we don’t need to provide any data in settings/mapping and defaults will be used.
For example:
PUT /users { "settings" : { //we define settings for our index here "number_of_shards" : 1 }, "mappings" : {// schema. We define fields of our documents here "properties" : { "name" : { "type" : "text" }, "date" : { "type" : "date" }, "salary": {"type": "integer"} } } }
In order to insert a document into an index we’ll execute HTTP POST {host}/{name_of_index}/_doc/
For example:
POST users/_doc/ { "name" : "John Doe", "date" : "2009-11-15T14:12:12", "salary": 3000 }
A response, in this case, would look like:
{ "_index" : "audit", "_type" : "_doc", "_id" : "mNMGS20B_e9lX4DjI5XZ", //This is id of record we created "_version" : 1, "result" : "created", }
By default, elastic will create a random ID for each document but we can also provide an ID for each record by adding {id} into the request: HTTP POST {host}/{name_of_index}/_doc/{id}
So, what is the key to getting analytics data from Elasticsearch? Well, it’s all about Search and Aggregations. First, we’ll use query to filter out only data we care about, and then we’ll use aggregations to collect that data into meaningful analytics.
Querying Elasticsearch
Elasticsearch provides JSON based on DSL to define queries.
To compare with SQL, this SQL query:
SELECT sum(salary) FROM users WHERE name = "Jon" LIMIT 10;
in Elasticsearch will look like:
GET users/_search { "size": 10, "query" : { "bool": { "must": [ { "match": { "name": "Jon" } } ] } }, "aggs": { "SalarySum": { "sum": { "field": "salary" } } } }
Key parts of Elasticsearch DSL would be query and aggs.
Query
Within query, in Elasticsearch search request we define the data we want to fetch. To compare with SQL, this would be a place where we put all our conditions like name = ‘Jon’ or age BETWEEN 10 AND 30…
Aggs
This is the place where we define our aggregations.
Elasticsearch has a large spectrum of aggregate functions which make it easy to get all kinds of different analytics from the data set. Full documentation on Aggregations can be found on this link.
The best part is that due to the powerful search and flexibility of aggregations, we can use Elasticsearch to build an awesome analytics engine.
Hands-On
Data preparation
Let’s imagine we work for a Bookstore and our manager Jane requested that we provide analytics.
Let’s say our book data is stored within some relational database with tables for customer, book, audit, or something like this. Our audit table would hold information about book, customer, date when the book is ordered…
For example, it could look like this:
INSERT INTO audit(book_id, customer_id, order_date, price) VALUES (1, 2, '2019-10-27', 1.99)
Our first step would be to transform/denormalize the data into something more Elasticsearch friendly.
We will create an index audit with mapping:
PUT audit { "mappings": { "properties": { "book": { "properties": { "category": { "properties": { "id": { "type": "long" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } }, "id": { "type": "long" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } }, "customer": { "properties": { "firstName": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "id": { "type": "long" }, "lastName": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } }, "price": { "type": "long" }, "time": { "type": "date" } } } }
so our documents in elastic would look like:
POST audit/_doc/ { "book": { "id": 15, "name": "JavaScript: The Good Parts", "category": { "id": 5, "name": "Software" } }, "customer": { "id": 12, "firstName": "John", "lastName": "Doe" }, "price": 50, "time": 1510777312961 }
Note that, unlike SQL in Elasticsearch we prefer to denormalize data and store all the information we can within a document. This would allow us to be flexible enough and to perform fast queries. It is possible to set up relations between documents in an index but this will affect our performance.
However, feel free to use it if you find it necessary for data redundancy. In our case, it’s not likely that a book name, consumer name will change so we’ll keep it like this.
Ready to go
OK, now we have our data indexed so let’s see what our clients want to see.
Jane: I want to see which books are the most popular
This is an easy one, we’ll just get the top N frequent books (by id in our case). We can achieve that by using terms aggregation. For example:
GET audit/_search { "aggs": { "TopBooks": { "terms": { "field": "book.id", "size": 10 } } } }
Terms aggregation is used to fetch unique values along with their count. Terms aggregation is bucket aggregation and in its response we’ll get a number of buckets where the key of each bucket will be a book.id in our case. As we specified “size”: 10 we’ll get a maximum of 10 buckets in response. The example response may look like:
{ ... "aggregations" : { "TopBooks" : { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets" : [ { "key" : "15", //This is book.id property from our index "doc_count" : 650 //This tells us there was 650 books with book.id = 15 }, { "key" : "642", "doc_count" : 369 }, { "key" : "1, "doc_count" : 222 } ... more buckets ] } } }
Jane: Ok, but I just want those from last month
Sure, let’s just add a filter for the time field.
GET library_audit/_search { "query": { "bool": { "must": [ { "range": { "time": { "gte": "now-1m" } } } ] } }, "aggs" { ... } }
To filter out data by date we use Range query. Note that now we’re writing our query as part of a query object, not aggs. query part of an elastic query is used to filter out the data we want. Later, the filtered data is used in aggs. So, in our case we will first filter only the documents within the given range and then use the results of query in aggregations.
Jane: Can we get them grouped by category?
Well, yes. We can use sub-aggregations for such a case. Sub-aggregations allow us to aggregate data from the results of a previous aggregation. In our case, we will first aggregate by category.id to get top book categories, and then as asub aggregation of categories we’ll add aggregation by book.id. This way we’ll get top books for each category.
{ "aggs": { "CategoryBreakDown": { "terms": { "field": "book.category.id", "size": 10 }, "aggs": { "TopBooks": { "terms": { "field": "book.id", "size": 10 } } } } } }
We can add as many sub-aggregations as we want, as long as the aggregation we use supports sub-aggregations.
Jane: Nice, but can we get these results for each week/month/year?
To achieve this we’ll add date_histogram aggregation as our root aggregation. Date Histogram will bucket our data based on the interval we set. So, if we decide to split our data into buckets of months, we’ll set “interval”: “month” and we’ll get buckets for each month. Now we can sub-aggregate each month bucket with analytics of interest. In our case we’ll aggregate top books for each month.
{ "aggs": { "PerMonth": { "date_histogram": { "field": "time", "interval": "month" }, "aggs": { ... } } } }
Example response would look like:
{ "aggregations" : { "PerMonth" : { "buckets" : [ { "key_as_string" : "2019-05-01T00:00:00.000Z", "key" : 1556668800000, "doc_count" : 167, "TopBooks" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 5, "doc_count" : 41 // this tells us that we sold 41 book in May 2019 with ID 5 }, { "key" : 2, "doc_count" : 38 }, { "key" : 3, "doc_count" : 34 }, { "key" : 4, "doc_count" : 29 }, { "key" : 1, "doc_count" : 25 } ] } }, { "key_as_string" : "2019-06-01T00:00:00.000Z", "key" : 1559347200000, "doc_count" : 279, "TopBooks" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ...
Jane: Nice, lets get some statistics on book prices
In this case we can use stats or extended_stats aggregations. These aggregations are doing multiple statistics over numeric fields, for example count, min, max, avg, sum…
{ "aggs": { "PriceStatistics": { "stats": { "field": "price" } } } }
where the response looks like:
{ "aggregations" : { "PriceStatistics" : { "count" : 999, "min" : 5.0, "max" : 100.0, "avg" : 50.72472472472472, "sum" : 50674.0 } }
Of course, it is possible to execute any of the aggregations separately.
A lot more
These are just a few simple examples on how to get some analytics from Elasticsearch but I think you get the idea. There is much more that can be done, and a lot more aggregations to explore. Just to mention a few:
-
Filter aggregation You could actually write your own filter to be an aggregation. Very useful when you want to bucket something in a few buckets but cannot achieve this with other aggregations.
-
Range aggregation You could specify ranges of interest to aggregate data. A good example would be aggregating users by age into a few buckets.
-
Geo aggregations There are a number of geo aggregations that you might find interesting. For example, you could use geo distance aggregations to aggregate stores based in the city center.
-
A lot of other aggregations that could fit your case.
-
Elastic also has a way of managing relationships. There are also ways to aggregate data using relations.
Conclusion
As you can see, elastic is a very powerful tool for producing fast and flexible analytics. We used the output of aggregations to visualize our data and produce reports for our business team.
Use the search part for limiting the data you’ll aggregate. Use histogram aggregations to display trendings. Use Elasticsearch to build awesome analytics dashboards!