So you’ve decided to use Redshift’s Materialized Views for your next project? Great, you just need to define your queries, create the views, and you’re good to go, right? Not so fast! Materialized views can be powerful, but you need to prepare yourself for many limitations and troubleshooting that come with them. In this blog, I will summarize my experience and try to save you from possible headaches I got while trying to make them work as I wanted.

What are Materialized Views?

Simply said, Materialized views (short MVs) are precomputed result sets that are used to store data of a frequently used query. They are mostly used in data warehousing, where performing complex queries on large tables is a regular need. Instead of running the same query for every computation, you can create a materialized view and query it like a regular database table. In fact, they are actual database tables that have metadata associated with them. They are beneficial if you have a repeatable or predictable complex query with joins and aggregations that is querying a lot of data in the base tables. In this case, materialized view can improve the performance drastically. One of the main features regarding materialized views is the fact that they can be refreshed (automatically or incrementally) which means that they don’t need to be recreated in case there is new data in the base tables. Sadly, even though they sound awesome, they have their limitations which will be covered in this blog.

It would be best if you also kept in mind that these concepts are based on PostgreSQL but, like Redshift, have many limitations. You can always visit Amazon’s official documentation for Materialized views for a detailed description.

Creating Materialized Views

Creating materialized views is as easy as creating regular ones, even as simple as running the query itself. The most basic case would use default configuration (which is sufficient in most cases) and have your view ready in a couple of seconds. However, learning the configuration and parameters can be crucial for efficiency. Luckily, there are only so many parameters to learn, and a detailed official description can always be helpful. Firstly, let’s look at the syntax.

CREATE MATERIALIZED VIEW mv_name
[ BACKUP { YES | NO } ]
[
    DISTSTYLE { EVEN | ALL | KEY } 
    DISTKEY ( distkey_identifier )
    SORTKEY ( column_name, ... )
]   
[ AUTO REFRESH { YES | NO } ]
AS query 

As you can see, it’s as simple as creating a regular view. 

Parameters

  • BACKUP – parameter is used to define if the MV will be backed up with your redshift instance, the default value is YES, but you might want to consider setting it to NO in development (or even production) if you have very large tables and you plan to recreate the view multiple times
  • DISTSTYLE – stands for Distribution style, and simply said, it defines how your data is stored across the nodes in the cluster. You can read more about distribution styles in Amazon’s official documentation
  • DISTKEY – stands for Distribution key and represents the value that will be used for distribution (if the distribution style is set to KEY). You can read more about distribution styles here
  • SORTKEY – stands for Sort key and represents the column that will be used for sorting the Material View. When data is initially loaded into the empty table, the rows are stored on the disk in sorted order. This key should be something that makes sense to be sorted, for example, dates. You can read more about sort keys here
  • AUTO_REFRESH – Defines if the MV’s should be automatically refreshed or incrementally. This is one of the most powerful properties of Materialized Views. The default value is NO
  • AS – represents a valid query that should be used to create the MV. This query must be simple, and the specific SQL constructs can determine if the MV can be refreshed incrementally or fully. This is the part that has the most limitations.

Refreshing Materialized Views

One of the main concepts that differentiate Materialized Views from regular views is the fact that Materialized Views are actual tables with metadata about their state (refresh status, dates, etc.). Refreshing the MV can be completely automatic, which means that Redshift will monitor the base table(s) and refresh the MVs when the database has the least load. This can be particularly useful when you don’t need real-time data and when you don’t want to worry about refreshing. On the other hand, if you choose manual refresh, you are entirely in charge of the refresh schedule. This approach is recommended if you have some fixed schedule or if creating the materialized views is a part of a pipeline/process. The syntax for refreshing materialized views is as follows:

REFRESH MATERIALIZED VIEW mv_name;

How to bypass the problem of non-incrementally refreshable MVs?

If your end goal is to have incrementally refreshable MV, but your query is too complex, you will probably have many problems while trying to make it work. The easiest solution for this problem is to break the complex query into at least two smaller queries, which (hopefully) will meet the criteria mentioned above. Then, you can use/reference MVs instead of the subqueries, which should solve the problem. The only thing that you need to be aware of is the fact that MVs that are dependent on each other must be refreshed in the reverse order that they were used. For example, if MV A is referencing MV B, which referenced MV C, then you should refresh the MVs in this order C => B => A. Trying to refresh them in any other order will produce an error.

Understanding the importance of the STV_MV_INFO table

If your query editor (like the redshift query editor) doesn’t return any information about the query or if you are interested in the current state of the MV, you can look it up in this table. Besides the regular information about database name, schema, MVs name, and owner’s user name, there are other, more interesting columns to review:

  • updated_upto_xid – represents the last xid that was updated in the MV. XIDs are IDs of rows that redshift internally maintains
  • is_stale – represents the state of the data inside of the MV. If the value is ‘t’, the data is stale and should be refreshed (or refreshed in case of auto-restart). Otherwise, if the value is set to ‘f’, the MV’s data is up to date with the base table data
  • state – represents the current refresh state of the column. Value ‘0’ represents that the MV can’t be incrementally refreshed because the main query violates some limitations for the query. Value ‘1’ means that the MV can be incrementally refreshed, but there are special states that represent an MV that can be incrementally refreshed after some issues are fixed
  • 101 – can’t access one or more columns from the base table. In most cases, this means that there were drop actions on the base table
  • 102 – can’t be refreshed because of the column type change in the base table. This constraint applies even if the column isn’t used in the MV
  • 103 – can’t be refreshed because the base table changed the name
  • 104 – can’t be refreshed because the column from the base table changed its name. This constraint applies even if the column isn’t used in the MV
  • 105 – can’t be refreshed because the schema name changed
  • autorewrite – value‘ t’ represents that the MV’s query can be automatically rewritten. Value ‘f’ means that the query will be updated if there are no other problems
  • autorefresh – value’ t’ represents that the MV will be automatically refreshed, while’ f’ means it won’t. This option is configured with configuration.

Every error code requires a full refresh. After the full refresh MV gets analyzed and updated with a new value for the state column.

General Tips and Good Practices

Tips and notes that are worth mentioning regarding Materialized Views:

  • INSERT, UPDATE, DELETE, and other data manipulation SQL structures will mark the MV as stale, which means that the automatical refresh would get triggered
  • You can use a materialized view as part of another materialized view. Keep in mind that, in this case, you need to update the MVs one by one sequentially. Also, in this case, updating the MV can mark every table that depends on it as stale, even though it wouldn’t change anything
  • It isn’t recommended that you manually update values in the STV_MV_INFO table
  • You can use any service (Lambda, EventBrdige, your service) that can connect to the Redshift cluster to refresh the materialized view. This can be particularly useful when the creation of MVs is part of a pipeline or one step in a stream
  • You can only alter the autorefresh property of a material view; every other change requires recreation
  • Try to find a good working SORTKEY and DISTKEY because they can be crucial for performance
  • If you reference any date functions inside a Materialized View, it will most likely be marked as stale even if nothing changes in the base table
  • Keep in mind that MVs represent the current state of the base table/s and won’t keep any historical data. If you update a column value in the base table, it will get recomputed for every row that references it, even if you use incrementally refreshable materialized views
  • Prefix or suffix the materialized view name with “mv_” or “_mv” based on your accepted naming convention.

Interested in more?

Click here for Atlantbh Blogs and Success Stories about Software Development.

Protractor parallel execution
QA/Test AutomationTech Bites
May 12, 2023

Protractor parallel execution

Why Parallel Testing? When designing automation test suites, the ability to run tests in parallel is a key feature because of the following benefits: Execution time - Increasing tests efficiency and, finally, faster releases Device compatibility - There are many devices with various versions and specifications that need to be…
Introduction to GraphQL
QA/Test AutomationTech Bites
May 11, 2023

Introduction to GraphQL

In today's Tech Bite topic, we will get familiar with GraphQL and explain its growth in popularity in recent years as the new standard for working with APIs. What is GraphQL? GraphQL stands for Graph Query Language. It is an API specification standard and, as its name says - a…
IQR in Automation Testing
QA/Test AutomationTech Bites
April 25, 2023

IQR in Automation Testing: Unleashing the Data Analytics Potential

Usually, when people talk about statistics, they think only about numbers, but it is much more.  Statistics is an irreplaceable tool in different fields like data analysis, predictive modeling, quality control, scientific research, decision-making models, etc. Overall, statistics provide a valuable toolset for understanding, describing, and making predictions about data,…

Leave a Reply