Utilizing TimescaleDB for Handling Time Series Data
Time is a concept that has always been both important and intriguing to analyze. Everything, including humans, produces data as it moves through time and interacts with different objects and scenarios. However, now that we live in the age of smart homes, virtual assistants and self-driving cars, among other advancements, it has become more important than ever to analyze vast amounts of data, and to do it fast. It is both a matter of convenience, as we expect our virtual assistants to always be on our beck-and-call, and safety, as we need self-driving cars to recognize and respond to changes in their environment very quickly. The mentioned applications rely on a form of data that measures changes over time.
Where there is a need, there is also likely a way, and this is no exception. Time-series databases, specialized for working with time-series data, have pretty steadily kept their place as the fastest growing category of databases, according to a database popularity ranking website, DB-engines:
Image source: DB-engines
A deeper look at time series data
In a nutshell, time series data represents the same thing measured over time and ordered by time. To better understand why this is necessary and how different it is compared to other types of data, let’s take a look at an example.
As an owner of a taxi company, you are interested in always having up-to-date information about your vehicles’ occupancy and whereabouts. In order to successfully do this, all you need is for all of your vehicles to be entered in a database, and then updated consistently according to what is currently happening. While you may still have a timestamp field in such a database (to track idle vehicles, for example), there are two main things to notice here – new data enters your database through updates, time is not the main focus, and at any given point we can examine the current state of the system.
On the other hand, if there is a need to have access to not only where our vehicles are, but where they have been, how long they traveled from and to each location, how many passengers they transported in every ride etc., we would need to monitor data continuously. Whenever new information was received, it would need to be recorded as a separate entry. So, contrary to our previous example, new data would enter the database through inserts, time would be the primary-axis and the data would arrive in time order in most cases.
While this difference may not sound like that much, it is what makes time series data so useful – the ability to see how systems behaved in the past and how they are behaving at the moments, giving us the ability to preserve valuable information on how data changes over time and predict how it may behave in the future.
While a traditional database technically can be used to process this kind of data, there is a reason why time series databases are the fastest growing category of databases – they can do it better. Most importantly, they fulfill two main objectives:
- Scaling – since all new data enters the database as a whole new entry, time series data accumulates very quickly and is much better handled by databases fine-tuned for time series data
- Usability – time series databases are often equipped with more specialized tools to enable time series analytics, data lifecycle management etc.
Choosing a time series database
After deciding on using a time series database for your application, there are still several factors to consider before deciding on which time series database. As with most things, there is no choice that is one size fits all, so it’s important to know your use case well enough to make an adequate decision. Some factors to consider are:
- Data model
TimescaleDB is the largest open-source relational database for time series data. It is essentially an extension built upon PostgreSQL, and as such has full SQL compatibility and a relational data model. The relational data model has its advantages and disadvantages and needs to be considered for each individual use case. Generally, it requires more work to get started, but it is quite flexible and good for validation. NoSQL data models are much easier to get started with, but offer other limitations. SQL compatibility might be one of the biggest strengths of TimescaleDB, as SQL is one of the most commonly known and used programming languages, it has a huge community of millions of developers and many tutorials and training are available for those who are yet to become familiar with it.
An essential metric to consider with databases is reliability – ease of use makes no difference if data gets lost or corrupted. Considering the fact that TimescaleDB was built on PostgreSQL, it relies on a very good foundation of decades of hard work the PostgreSQL community has done. This work is widely recognized, as many tech giants worldwide also rely on PostgreSQL for critical operations. Another perk of this is being able to inherit from all the different tools and plugins this huge community develops over time.
The reason we chose TimescaleDB in our project is because it has been shown to perform significantly better in most use-cases, it supports SQL natively and has high reliability due to being backed by PostgreSQL. Due to these characteristics it is very easy to start using quickly as a developer because of the minimal learning curve while also being able to fulfill business requirements without sacrificing on quality, which is hugely important.
Getting started with TimescaleDB
Now let’s get to the fun part – using TimescaleDB in projects! Before we do that, we need to install it, of course, as well as create or select an existing database we want to use it with. Once those bases are covered, the process is pretty straightforward and very easy. The next step is to add TimescaleDB to our database:
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
The primary point of interaction with data when using TimescaleDB are hypertables. They are an abstraction layer designed to be intuitive for use to users familiar with standard PostgreSQL tables. Internally, hypertables are automatically split into chunks that correspond to certain time intervals. In spite of this, most of the time users don’t need to concern themselves with these chunks, as all commands made to hypertables propagate the changes to all of the chunks that belong to the hypertable in question. In order to create a hypertable, first we need a standard table which has a time column of one of the following types:
Referring back to the Taxi company example that was given earlier, such a table might look like this:
CREATE TABLE ride ( pickup_time TIMESTAMP NOT NULL, location TEXT NOT NULL, num_passengers INTEGER NOT NULL, ride_distance FLOAT, fare_amount FLOAT );
The next step is to execute the TimescaleDB command that will make this newly created table into a hypertable. The command used is simply create_hypertable and it has two required arguments – main_table and time_column_name. An optional argument worth mentioning is migrate_data, which is a boolean argument that should be set to true if you are creating a hypertable for a non-empty PostgreSQL table, because otherwise an error will occur. An important note to make is that migrating data might take a long time. If your use case surpasses the mentioned arguments, you might want to check out some other options listed in the documentation. For the example given here, partitioning on just time is enough:
SELECT create_hypertable('ride', 'pickup_time');
And you’re done! From here on out, you can use this table via “normal” SQL commands, such as INSERT, ALTER, DROP, SELECT etc in the way we’re already used to doing. An example of an insert command could be:
INSERT INTO ride(pickup_time, location, num_passengers, ride_distance, fare_amount) VALUES (NOW(), 'Milana Preloga 12A', 2, 2.46, 4.16);
As you can see, using TimescaleDB is no different than using a standard PostgreSQL database which is one of its’ many advantages. With just a little bit of prep work, it’s possible to get started with time series data in no time.
Considering that TimescaleDB is based on PostgreSQL and retains all of its functionalities, integrating it into a project is very easy using frameworks such as Hibernate if using Java.
Why we use TimescaleDB
The current situation in the world with the Covid-19 pandemic has shaken up everyone and everything, from people to whole countries and industries. One of the industries that has found itself thriving in the midst of all the chaos is the delivery industry. From businesses migrating overwhelmingly to online shopping, to essential goods and services needing to stay operational, the demand is greater than ever, and the supply needs to be caught up. Unprecedented times are tricky to navigate and making mistakes is expensive which is why finding ways to minimize them is crucial.
One necessary contribution to the solution for this problem through working is continually tracking the goods that are being transported through several different metrics to ensure that the optimal conditions are met and no spoilage occurs. In order to make this possible, it is necessary to check the state of all the different vehicles and products in regular small time intervals.
Each state is accompanied by the corresponding timestamp. The timestamps are important because time is of the essence when dealing with issues, and it’s an important part of the full picture when reviewing issues after the fact. The changes in different states over time are what makes it possible to analyze what was and is happening, what has been dealt with and what still needs to be addressed. Not only is this information necessary in order to avoid any expensive failures, it is also valuable when addressing problems that do happen, because it is possible to analyze each chunk of time within a certain interval to gain a better understanding of what needs more attention in the future.
This is where time series data, and TimescaleDB come into play – the data accumulates fast, and still needs to remain efficiently retrievable when resolving queries despite the massive amounts of it. The hypertables we’ve mentioned before are the perfect tool to make this happen due to the non-overlapping chunks they create by default which concern a specific time interval and minimize the amount of data that needs to be processed in order to resolve a query.