Data Quality Analysis at Its Simplest
Maybe you have heard the expression ‘Data is the new oil’ phrased by the founder of Clubcard, Clive Humby. But, what does it mean? Does it imply that today data is as valuable as oil in terms of the economy? For sure. However, there is another important aspect. When we think of oil as something of value, we think of its refined forms such as petroleum, naphtha etc. Just as with oil, ‘unrefined’ data does not bring significant value in terms of business and economy.
In order for data to be usable and useful, it needs to undergo different steps of data analysis that will lead to improvements in business decisions. Consider millions of records of customers’ data, such as information about their job, social activities, and education. Data analysis requires inspecting data from different aspects. For example, finding out different patterns and trends in customer data can help in guessing future purchase behaviours.
Not all data is suitable for decision-making. The first step upon collecting data is inspecting the data quality in order to find out whether the data we own is useful material for different business decisions. Quality comes before quantity, which does not degrade the importance of quantity, since it is a vital part of making accurate decisions. The more data, the more accurate your decisions.
When we talk about data quality, it can be observed through different levels. We are going to concentrate on the following ones:
- Data completeness
- Data uniqueness
- Data validity
- Data consistency
Data completeness refers to inspecting missing/null values in the dataset. However, there are some values that can be observed as null, although they are not. For example, if we talk about the Name column, the value ‘No Name’ is equivalent to the null value. Those kinds of values are called dummy values and will be inspected in the data validity section.
Missing data can be observed on the column and row level, depending on what we are aiming to find out. If one column holds great importance for us, then the percentage of missing data per column is more significant than missing data per row. However, if most columns have similar importance, then observed missing data on the row level is going to give a clearer picture of how much relevant data is missing.
The first association when we mention uniqueness in the dataset are duplicated rows. However, in addition to that, finding out how many distinct values there are per column can give us a picture about data diversity. Besides that, distinct values may give us some insight into data validity in a way that greater or lower numbers of distinct values than expected can indicate that there are issues in the provided values.
For example, let’s say we have two columns: Country and Capital City. If we have a discrepancy between the number of distinct values of those two columns, that might indicate that the data has some quality issues, because we do not expect to have two countries with the same capital city or vice versa.
Another aspect of the data uniqueness is having a column, mostly an ID column, that uniquely identifies every single row in the dataset. The importance of this column can be understood from the situation when we are dealing with two deliveries of the same dataset, and we need to find out what is the difference between them. Having a unique ID column that has kept the same values across both datasets (IDs were not changed between deliveries) will make our job significantly easier and faster.
Inspecting data validity depends on the type of data. For example, if we deal with personal names, we do not expect digits as part of the values. Besides that, we might check the length of the names to find out extremely short/long names for which sample checks can be done, since those will most likely be suspicious values. If we deal with numeric data, then having a validity range is useful in detecting potentially invalid values in the dataset.
In the case of phone numbers, we will probably have a phone format for which we expect data to follow. Also, these kinds of columns are very prone to have dummy values. For example, some of those values could be Not given and Unknown.
Now when we have these aforementioned dummy values, here is the list of some potential dummy values in the English language that could be searched for in the dataset:
- No name
- Not given
Another thing to have in mind when inspecting data validity are special characters. In some columns they might be acceptable, such as the plus and dash sign in the Phone column. However, in some columns we do not expect them, and values of those columns might be equivalent to null (ex. Using slash (/) instead of leaving it blank) or indicate another data issue (ex. ‘Mega Supermarket (in New York)’).
The second example might be unclear at first, so let’s stop there a little bit. If we want to derive some useful information from data, we expect every column to hold what it says in its header. For example, the Market Name column should contain only names of markets, in our case ‘Mega Supermarket’. Having additional data, such as the market location in the Market Name column, indicates validity issues, although the additional information itself would be useful, but in a case where it was located in the proper column, for example, Location.
One more thing to keep in mind is checking for language-unspecific characters in the dataset that might indicate some encoding and script issues that can seriously degrade data quality. For example, if we are dealing with data in the English language, the value ‘Auštin’ (where ‘Austin’ was expected) will indicate that the data is potentially invalid, or that it contains partially invalid values in the dataset.
Data consistency can alter statistical information about the data. For example, if we want to find out the distinct values mentioned in the Uniqueness part, having inconsistent data may lead to false information. For example, having leading ‘00’ and ‘+’ or none of them in the Phone column will make it harder to form a picture about the data validity in a way that it will be hard to determine whether some phone number is missing/having extra digits.
Another example which can be frequently seen in the dataset is the usage of abbreviation vs. full form. If we think of the City column, we might have values such as ‘San Francisco’ and ‘SF’ which have the same meaning, but will alter our data analysis result if not standardized properly.
That was a brief introduction with short guidelines for the data quality analysis process. There are, of course, many more things that could be mentioned, especially when concentrating on the specific type of data, for example spatial data. An important thing to keep in mind is that every dataset is a story for itself, and following strict guidelines for the data analysis process might lead to missing important points that will affect the end result. As opposed to that, after checking for some ‘standard’ issues in the dataset, one should dig in and search for any specificities that might indicate validity issues.
If you are interested in looking into practical example of the data quality analysis process, here is a great blog post about it.