Outliers: how do they happen and why are they important?
Outliers are extreme values of the data. They are record values that significantly differ from the rest of the data. Outliers can vary in different values, abnormally low or abnormally high. In a set of data, there may be more than one outlier.
There can be multiple causes of outliers. Some of the most common causes are data entry errors, incorrect sampling of the data, inappropriately scaled values, or they are simply legitimate but unusual observations.
Outliers are an important factor in statistics and statistical modeling and analysis since they can significantly impact the results. The presence of one or few high values in a small sample size can totally skew the results of analyses, leading us to make decisions based on faulty data or less effective and less useful models.
That is the reason why it is of great importance to implement checking and detecting them, in data we analyze, as an essential step prior to any analyses or conclusion and decision making.
How can we detect outliers?
There are several ways to detect outliers depending on how our data look and behave. In the following sections, we will look at the most commonly used ones with explanations and examples in Jupyter Notebook using Python.
Graphical representation
Visual inspection is one of the simplest ways to detect outliers. Whether it is a histogram or scatterplot, we can identify outliers by looking for data points that fall far outside the range of the majority of the data. This way, we can get insight if there are possible outliers, but we can not claim in all cases with complete certainty that they are significantly different from the rest of the data, just based on graphical representation.
import numpy as np import pandas as pd import scipy.stats as stats import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot')
Suppose we have next numpy array of 60 values:
array = np.array([102.21003263, 96.59534991, 94.22251458, 95.95968445, 93.96710573, 97.56147721, 110.35343738, 105.80753411, 102.52413736, 85.37454047, 84.51658677, 102.65272728, 99.99103792, 89.84007599, 88.08047948, 115.555439, 90.66700079, 105.2653425 , 98.50035419, 137.5045381, 105.22785746, 113.42154872, 115.5852127, 112.55994075, 109.08603305, 91.5909862 , 110.05585097, 89.76195821, 101.81773913, 98.83319054, 119.78408749, 119.17444677, 89.68993001, 95.16688086, 93.54919851, 100.60348157, 105.47930439, 108.45148465, 87.92818679, 103.74840337, 109.42347828, 96.48736071, 104.59735029, 99.51483336, 104.15481911, 92.98123865, 101.85464465, 104.68349778, 100.39523351, 107.1614054 , 95.18311603, 99.47508811, 97.27721446, 103.07985259, 82.39346405, 111.50223951, 97.82268931, 87.7941262 , 96.56384564, 115.48762187])
We can use seaborn displot to create a histogram:
sns.displot(array, kde = True)
We can notice that the distribution mostly follows a normal curve but has a high value that does not fit the rest of the data and represents a possible outlier. This value will definitely have an impact on the test of normality.
Empirical rule for normal distribution and Z-scores
The empirical rule (also known as the 68-95-99.7 rule) states that for a normal distribution, around 68% of the data will fall within one standard deviation of the mean, 95% of the data will fall within two standard deviations of the mean, and 99.7% of the data will fall within three standard deviations of the mean.
In case our data is normally distributed, we can proceed to calculate Z-scores. Z-scores are used to convert the data into another dataset with the mean = 0. Z-score shows how many standard deviations that particular record is far from the mean of the data.
Formula for computing Z-score
For computing Z-scores, we need to determine mean (μ) and standard deviation (σ) of the data. After calculating Z-scores, we check if there are values with a score higher than the value of absolute 3, since 99.7% of data fall in the range from -3 to 3. In case we find them, those records represent outliers that significantly differ from our data.
While it is possible to use Z-scores to detect outliers in non-normal data, the results may not be reliable or interpretable as they would be for normally distributed data. It is important to consider the specific distribution of the data and to use other methods to detect outliers if necessary.
We will use the Shapiro Wilk test for normality on our array to examine if our data is normally distributed.
stats.shapiro(array)
The Shapiro-Wilk test is a statistical test used to determine whether a given dataset follows a normal distribution. The null hypothesis for the Shapiro-Wilk test is that the data is normally distributed, while the alternative hypothesis is that the data is not normally distributed. Therefore, if the p-value for the Shapiro-Wilk test is less than the chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that the data does not follow a normal distribution. Simply put, p value tells us how wrong we will be if we reject the null hypothesis.
In our case, p value is equal to 0.09, and on the significance level of 0.05, we can not reject the null hypothesis, and we conclude that our data follows a normal distribution. After we concluded that our array is normally distributed, we can be sure that the empirical rule will give us correct results. To calculate Z-scores, we can simply pass our array to stats.zscore().
z_scores = stats.zscore(array)
Since we are looking for values higher than 3 or lower than -3 we will check min and max of scores.
min(z_scores)
max(z_scores)
We can notice that there are no values lower than -3. We have at least one value that is higher than 3. We can filter out scores and find records with values higher than 3.
z_scores[z_scores > 3]
We found one record that has a score higher than 3 and it represents the outlier.
Interquartile range and boxplot
The most important positional measure of central tendency is the median. The median is the score found at the exact middle of the set of values. The median is smaller than or equal to 50% of the records and larger than the other 50%.
The interquartile range of the data is the difference between the third and first quartile. It is the range for the middle 50% of the data.
IQR = Q3-Q1
Formula for computing interquartile range
Boxplot is a graph that enables the positional representation of 50% of values of data inside the box, and in that way, it gives us dispersion analysis. Records that are 1.5 box in length far from the box represent outliers.
Lower whisker boundary – Q1 – 1,5 * IQR
Upper whisker boundary – Q3 + 1,5 * IQR
We will return to our array again and implement this way of detecting outliers. First, we will calculate the first and third quartile. Then with those two values, we can calculate the interquartile range and finally calculate the boundary for both whiskers.
#calculate first and third quartile q1 = np.percentile(array, 25) q3 = np.percentile(array, 75) #calculate interquartile range iqr = q3 - q1 #calculate upper and lower whisker boundary upper_boundary = q3 + 1.5*iqr lower_boundary = q1 - 1.5*iqr lower_boundary, upper_boundary
We will now just filter our array and find if there are any records that go out of these boundaries.
array[(array < lower_boundary) | (array > upper_boundary)]
In the output, we got that one value falls out of the calculated boundaries and represents the outlier. Finally, we can also inspect all this through a boxplot and notice the same result.
sns.boxplot(y = array, width = 0.20, flierprops={"marker": "x"}) sns.stripplot(y = array)
Next steps?
Sometimes outliers are significant pieces of information and should not be ignored. Other times, they occur because of errors or misinformation and should be ignored or fixed. Before considering eliminating outliers from the data, we should try to understand why they appeared in the first place.
There are many ways to handle and deal with outliers, which is a topic in itself, but there is no quick and clear fix for them. In most cases, experience and expertise will play a big role in deciding how to handle outliers best. Some of the most common options in handling the outliers are to delete them, change them by assigning them different values, or transform the data to reduce the variation.
Wrap up
In conclusion, detecting outliers is an important step in data analysis as they can skew the results and can lead to incorrect conclusions. Various methods can be used for detecting outliers, including graphical representation, Z-scores, interquartile range, and others. Depending on the data characteristics and goals of the analysis, a combination of methods may help us to achieve the most accurate results.
“Outliers in data and how to detect them” Tech Bite was brought to you by Amar Aladžuz, Junior Data Analyst at Atlantbh.
Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.