Imagine entering the world of analytics
Are you a film enthusiast? Can you name all the thrillers released between 2010 and 2015 together with their directors and leading roles? If yes, good for you! But let’s imagine for a moment that you know as much about movies as the average person that works in an IT company.
Now, let’s imagine the following situation: You were given a task by one of the most famous online movie databases to analyze a dataset they’re considering purchasing to enrich their base with. It’s your task to assess the quality of the dataset and based on your recommendation, further actions regarding the purchase will be taken. Oh, and yes… You have a deadline too.
What does quality mean anyway?
First things first: What does quality mean anyway? Well, I’m sure you want to drive a car of a certain quality: you want it to be safe and robust, as well as to have comfortable seats, AC, maybe even parking sensors or a reverse camera. Even though you love your car, you don’t want to spend an afternoon off driving it – you want to spend quality time with your family and friends as well, so you go to a park or out for a drink.
So, how would you define quality? Obviously, there isn’t a straight-forward way to do that. Probably the most appropriate explanation would be that quality means meeting one’s expectations.
Your client, for example, expects to get data that has a valid format and contains the correct information and it’s your task to assure they get what they want. A different outcome may cause many problems for your client: a decreased number of users, loss of business partners, which could later lead to a drop in the business’ value, etc. Such a situation would, of course, reflect poorly on your company, not only in the loss of that client but also their reputation.
Making friends with your dataset
Dataset consists of 10,000 records where each record represents one movie. Each movie is described with 11 attributes (columns in the provided table), which means you have 110,000 cells to analyze. Take a look at a small portion of your dataset below:
Always spend some time scrolling through the entire dataset, write down whatever appears suspicious (it might later turn out to be or not to be an issue), and make sure you understand what each column represents. Once you’re fully familiar with all the columns and what they should contain, move on to the analysis.
Let the analysis begin!
During your scroll through the dataset, you must have noticed column ID. Almost always this column represents an identifier of a record and is used when referring to that record. Therefore, all IDs should be unique (there shouldn’t be duplicate IDs). The first step of your analysis should be to check if there are non-unique IDs. If you find such cases, make sure to check if double IDs refer to the same or to different movies and try to come to a conclusion about why such duplicates exist. However, make sure to mark them and report them to your client.
The next step is to check if there are missing values in the analyzed dataset. It is obvious even from the sample from the last paragraph that not all fields fulfill the requirement of completeness. Find the percentage of missing data per column and write it down – this is one of the most important factors in quality assessment. Incomplete rows in a dataset will later represent incomplete information in a movie database, which is not acceptable. Imagine that you’ve seen a great movie and you want to find a few others by the same director. You know the movie’s title, so you look it up in this very database and… there is no information about the director – the field where their name should be is empty. I don’t know about you, but I’d probably use another source of information next time.
Now that you know how many values you actually have, it’s time to find out how many valid values are amongst them. What does “valid” mean anyway? Field validity depends, of course, on the nature of the field. If you are analyzing e.g. field “Cast”, you expect it to contain names of actors, which means that numbers or special characters should not be a part of this field.
On the other hand, if e.g. “Year released” is a field of interest, then only four-digit numbers are expected to be present.
Depending on the analyzed field, there are different analyses to be applied in order to assess its validity.
a) Title, Description
Most database users will use a title when looking for information about a movie. Therefore, this field’s validity is crucial. Here are a few tips on how to assess the quality of a movie’s title and description:
– Look for records that do not contain a single alphanumeric. Such titles/descriptions are most certainly invalid.
– Look for records which contain alphanumeric, but special characters as well. Pay attention to those that are less likely to be part of the movie title or description: %, #, @, etc. Even though it’s not expected for these fields to contain such characters, it is not impossible, so don’t just assume they are invalid – do a sample check. Choose 5-20 records (depends on the total number of records that meet the criteria) and double-check them on the Internet. This will give you much better insight.
– Look for dummy values. Are there records with title/description such as “Test”, “Name”, “NA”, “Not available”, “Title”, “Description”? Are there extremely short titles/descriptions? Make sure you double-check them!
– Look for veeery long titles/descriptions. Do they look suspicious? Do a sample check!
b) Year released
– Look for non-four-digit numbers. If there are such records, check what they represent. Are those just dummy values or maybe a shorter way to write a year, full date, or year written with letters rather than digits? Anyhow, make sure you mark such records.
– Look for suspicious values. Are there years older than the 1920s or younger than 2019 (or your current year)? That is definitely a red flag.
c) Director, Cast
– Since these fields contain personal names anything different than letters is potentially suspicious. But keep in mind that non-English names are likely to appear, which means you have to count on letters that are not part of the English alphabet, as well as on e.g. Chinese or Japanese characters.
– Look for dummy values using the same principles as for the fields Title and Description.
d) IMDB Score, Metascore
– These columns are expected to contain only numbers, so anything different than that is most probably an invalid value. If you find such cases, make sure you double-check them. Are they just dummy values or maybe scores provided in a different format, e.g. “seventy” instead of “70”? Even though a different format might seem OK to you, keep in mind that this dataset is about to be used in an existing database. Since existing scores in the database are numerical values, the ingestion of a string is not expected in that column and will cause problems. Even if ingested, such records would cause inconveniences, e.g. when sorting by IMDB score.
– Watch out for scores that are out of range. What is the score range for IMDB? And what about Metascore? Once you have this information, you can easily check if there are invalid score values in your dataset.
– Find a frequency distribution of genres. Are there any dummy values? Are there suspiciously frequent genres? Make sure to do a sample check for such cases.
– Firstly, check if all records have the length in the same format (e.g. 123 min).
– Check if there are dummy values (e.g. 0 minutes or negative values) or suspiciously short/long movies.
– You can also make a frequency distribution and see if there is a length that appears very often and do a quick sample check on that subset.
– After doing basic format checks, you can check if there are movies that earned suspiciously high values. Or are there any who earned (almost) nothing?
Even after all fields have been analyzed, you still can’t claim the quality of the dataset. Since fields are mutually connected, their connections should be examined as well.
Imagine, for instance, a record for which no irregularities were found in any of the columns, but no movie with the provided title was released in the provided year. Another scenario could be an actor who actually didn’t act in a movie that the dataset claims he was in. Therefore, it is necessary to do a so-called field-to-field check. This, in most cases, means doing sample checks. You can for example google a certain number of movie titles and see if the provided actors, directors, years of release, etc. are corresponding.
Another sort of field-to-field check could be performed on the fields IMDB Score and Metascore. Are there movies with one extremely high score and yet another extremely low? Isn’t that suspicious?
Since your client’s goal is the enrichment of their movie database, they will most certainly be interested in knowing the number of distinct movies in the analyzed dataset. For that reason, don’t forget to find out how many of them are there. You’ll first need to determine what a duplicate is. Are two movies with the same title duplicate records? Not necessarily. And what if they were released the same year? Or directed by the same person? A right combination of fields will assure you find and count distinct, as well as, duplicate records.
Just one more step left
Now that you’ve finished your analysis and have a messy bunch of papers on your table and even a messier bunch of files on your computer, it is time to write a concise, precise, and understandable report for your client.
Non-unique IDs, duplicate records, and missing data should definitely be on the top of your report. Make sure you mention all significant irregularities you found ordered by the severity of their invalidity. You should consider writing not only a resume but also notes on the main advantages and disadvantages of the analyzed dataset.
Don’t forget to provide as many examples as possible for all the irregularities you found. It is much easier to explain a problem to someone if they are looking at it.
And just one more step left… Double check every conclusion that you’ve made, every sentence that you’ve written and every example that you’ve pointed to before sending a report to your client. Good luck!