Nowadays, we live in a world of data, which means we are constantly surrounded by enormous amounts of data. Data is generated by everything we do, both online and offline, from our internet activities and purchases to our physical motions and interactions. This data contains valuable insights and knowledge that can be used to enhance our lives.
It is safe to say that companies that can efficiently collect, analyze, and exploit data will be the most successful in the digital age. Data has become an immensely valuable resource, with governments, businesses, and individuals using it to make data-driven decisions that can be more accurate and reliable than ones based on intuition or assumptions.
Data analysis certainly plays an indispensable role in today’s data-driven world. It can be explained as the process of examining large datasets in order to discover patterns, correlations, and trends. One of the data analysis techniques that focuses on gaining valuable insights and finding patterns in data is clustering. In this Tech Bite, we will first define what clustering is and then explain two density-based clustering algorithms: DBSCAN and OPTICS.
What is Clustering?
Machine learning establishes a clear distinction between supervised (e.g., classification) and unsupervised (e.g., clustering) tasks. The main difference is that classification requires labeled data to predict the class of input, while clustering uses unlabelled data and groups similar inputs together based on their characteristics. Generally speaking, we can say that clustering is a more challenging problem than classification.
Clustering is commonly defined as the process of discovering structure in data by grouping related objects together, and the resulting groups are referred to as clusters. A cluster is a group of objects that are more similar to each other than to objects in other clusters. If we think of those objects as points in data space, we can represent their similarity using a certain distance measure. In this way, points closer to each other are more likely to be grouped in the same cluster, while points far apart are more likely to be assigned to different clusters.
There are many different types of clustering algorithms, some of them are centroid-based clustering, hierarchical clustering, distribution-based clustering, and density-based clustering. Each type of clustering algorithm has its strengths and weaknesses and is suitable for different data types.
DBSCAN
The term “density-based clustering” refers to a group of clustering algorithms that group data points based on their closeness to dense regions. In other words, the main concept behind these types of algorithms is pretty straightforward: given the input set of data points, group data in a way that accurately reflects the underlying data density. Some of the main strengths of these algorithms are their capability to find arbitrarily-shaped clusters, handling different amounts of noise, and not requiring any prior information about setting the number of clusters. One of the most used density-based clustering algorithms is certainly DBSCAN (‘Density-Based Spatial Clustering of Applications with Noise’). [1]
Understanding DBSCAN
DBSCAN requires two main hyperparameters, namely:
- Epsilon (ε) – the maximum distance between two points for one to be considered as in the neighborhood of the other. A distance can be defined as any type of distance function (e.g., Euclidean distance).
- minPoints – the minimum number of points in a neighborhood for a point to be considered as a core point (this includes the point itself).
Using these hyperparameters, DBSCAN classifies the dataset points into:
- Core points – a point p is called a core point if at least minPoints (including itself) are within distance ε of it.
- Directly reachable points – a point q is directly reachable from point p if point q is within distance ε from core point p.
- Reachable points – a point q is reachable from point p if there is a set of points that form a path from point p to point q and are directly reachable from point p. This means that all points that form a path, along with a point p, must be core points.
- Noise points (or outliers) – if a point is not reachable from any other point, it is considered to be an outlier or noise point.
Figure 1. DBSCAN point classification (Anja Plakalovic)
Abstract version of the DBSCAN algorithm
The DBSCAN algorithm starts with a random point p and finds its ε-neighborhood. If p is a core point, then it is assigned to a new cluster that is expanded by assigning all its neighboring points to this cluster. If an additional core point is found in this cluster, then the neighborhood is also expanded to include all its neighboring points. The process is repeated until no more points can be assigned to the cluster, and then we can say that the cluster is complete. The remaining points are then processed, and if another core point can be found, then a new cluster is created, and the process repeats. The algorithm terminates once all points have been processed.
Figure 2. Demo of DBSCAN algorithm (Anja Plakalovic)
Introducing OPTICS: an Extension of DBSCAN
The inability to detect clusters of varying density represents a significant limitation of DBSCAN. This disadvantage stems from the fact that DBSCAN uses one constant distance value (ε) together with one density threshold (minPoints) to determine whether a point is in a dense neighborhood. In this way, the DBSCAN algorithm actually assumes that the densities of different clusters are equal. However, many real-world datasets share the common property that their internal cluster structure cannot be characterized by global density parameters.
It did not take long for influential scientists to identify this deficiency and propose a suitable solution. In 1999, three years after the DBSCAN algorithm was published, some of its authors developed OPTICS as a particular form of DBSCAN extension. [2] The main difference between DBSCAN and OPTICS is that OPTICS generates a hierarchical clustering result for a variable neighborhood radius.
Understanding OPTICS
OPTICS (‘Ordering Points To Identify Clustering Structure’) is an augmented ordering algorithm which means that instead of assigning cluster memberships, it stores the order in which the points are processed. OPTICS requires the same ε and minPoints hyperparameters as DBSCAN, but with one important difference – the ε parameter is theoretically unnecessary. Explicitly setting the value of this parameter is only used for the practical purpose of reducing the algorithm’s runtime complexity. In the following examples, we will assume that the epsilon parameter is set to a very large value (i.e. ), as this is the case in many implementations of the OPTICS algorithm.
In addition to the concepts mentioned above of the DBSCAN algorithm, OPTICS introduces two more terms, namely:
- Core distance – the minimum distance required for a data point p to be considered as a core point. If the p is not a core point, then its core distance is undefined.
- Reachability distance – the reachability distance of point q with respect to another point p is the smallest distance such that q is directly reachable from p if p is a core point. This distance cannot be smaller than the core distance of point p, since for smaller distances there are no points that are directly reachable from point p. If p is not a core point, then the reachability distance of point q with respect to point p is undefined.
Figure 3. OPTICS core and reachability distances (Anja Plakalovic)
Reachability Plot
OPTICS algorithm generates the reachability plot, which represents a sorted list of points based on their reachability distance. In order to build a reachability plot, OPTICS begins with an empty seed-list and picks a random point p, finds its ε-neighborhood, and determines its core distance. The reachability distance of this first point is set to undefined, and current point p is written to the output list. If point p is not a core point, then the algorithm simply picks another random point from the input dataset. If point p is a core point, then the reachability distance of each neighboring point q with respect to point p is calculated. All neighboring points are then inserted into the seed-list, and the list is sorted in ascending order by the reachability distance value.
In the next iteration, OPTICS takes the point that is at the top of the seed-list and, in case it is a core point, finds its ε-neighborhood, determines its core distance, and calculates the reachability distance for each of its neighboring points. If the newly calculated reachability distance of some unprocessed point is smaller than the one present in the seed-list, the value is updated to a smaller value, and the seed-list is sorted. If the current point is not a core point, then OPTICS moves to the next point from the seed-list. The process continues until each point is processed and the seed-list is empty.
Figure 4. Demo of OPTICS algorithm (Anja Plakalovic)
The ordering output from the OPTICS algorithm can be visualized using a reachability plot which is a special form of a dendrogram. It is a 2-D plot with points in the ordering returned by the OPTICS algorithm on the x-axis and their reachability distances on the y-axis.
Figure 5. Reachability plot (Anja Plakalovic)
Extracting Clusters from the Reachability Plot
There are two basic ways to extract clusters from the reachability plot – manual and automatic using different algorithms. The manual way refers to either setting the range on the x-axis or using the threshold on the y-axis after performing the visual inspection of the reachability plot. Generally, clusters appear as valleys on the reachability plot so that deeper valleys represent dense clusters, while shallow valleys represent sparse clusters.
On the other hand, various algorithms attempt to extract clusters by detecting valleys by steepness, knee detection, or local maxima. For example, in Python and R, two algorithms can be used to extract clusters automatically:
- DBSCAN – performs ‘cutting’ the reachability plot using the specified eps_cl threshold.
- Xi – extracts clusters by detecting valleys by steepness using the specified xi threshold.
In the above examples of the DBSCAN algorithm, we used =0.5. Using the same value for the eps_cl threshold when extracting clusters from the reachability plot generated by the OPTICS algorithm, we get the same result as when using the DBSCAN algorithm. On the other hand, when using, say, the Xi method with a threshold of 0.1, we get a different result – all points belong to the same cluster, and there are no outliers.
Figure 6. Extracting clusters using different methods (Anja Plakalovic)
DBSCAN vs. OPTICS: Practical Example in Python
After we have explained how DBSCAN and OPTICS algorithms work, we can move on to a practical example in Python. In the code below, we first generate synthetic two-dimensional data using the make_blobs function from Python’s scikit-learn library. In this way, three clusters with different densities are created. Further, we define the necessary parameters: eps (ε), min_samples (minPoints), and distance metric. We then perform DBSCAN and OPTICS clustering using the appropriate methods from the scikit-learn Python library and visualize the results using the user-defined plot_clusters function. This function allows us to create customized clustering plots. In the end, we generate the reachability plot using the created reachability_plot function.
from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import pandas as pd import numpy as np from sklearn.cluster import DBSCAN, OPTICS plt.style.use("ggplot") colors = ["#00ADB5", "#FF5376", "#724BE5", "#FDB62F"] plt.rcParams.update({"font.size": 15}) def plot_clusters(algorithm, _df, _min_samples, _eps=""): unique_labels = set(_df.labels) for k, col in zip(unique_labels, colors[0:len(unique_labels)]): # Use black color for noise if k == -1: col = "k" # Use different color per cluster and add labels plt.plot( _df.loc[_df.labels == k].x, _df.loc[_df.labels == k].y, "o", color=col, markeredgecolor="k", markersize=15, label=(f"Cluster {k+1}" if k != -1 else "Noise") + f" ({_df.loc[_df.labels == k].shape[0]})", ) # Add legend and title plt.legend(loc="upper right") plt.title( f"{algorithm}: " + (f"eps={_eps}, " if _eps != "" else _eps) + f"min_samples={_min_samples}" ) plt.show() def reachability_plot(_df, model): # Get reachability distances and cluster labels reachability = model.reachability_[model.ordering_] labels = model.labels_[model.ordering_] unique_labels = set(labels) space = np.arange(len(_df)) # Generate reachability plot using different color per cluster for k, col in zip(unique_labels, colors): xk = space[labels == k] rk = reachability[labels == k] plt.plot(xk, rk, col) plt.fill_between(xk, rk, color=col, alpha=0.5) # Ordering in x-axis plt.xticks(space, _df.index[model.ordering_], fontsize=10) # Plot outliers plt.plot(space[labels == -1], reachability[labels == -1], "k.", alpha=0.3) # Add y-label and title plt.ylabel("Reachability Distance") plt.title("Reachability Plot") plt.show() if __name__ == "__main__": # Generate data centers = [[1, 1], [-2, -4], [5, -7]] data = make_blobs( n_samples=[30, 20, 10], centers=centers, cluster_std=[0.8, 1, 1.5], random_state=0, )[0] data = StandardScaler().fit_transform(data) df = pd.DataFrame(dict(x=data[:, 0], y=data[:, 1])) # Define parameters eps = 0.5 min_samples = 5 metric = "euclidean" # Perform DBSCAN clustering and visualize results dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric=metric).fit(df) df["labels"] = dbscan.labels_ plot_clusters("DBSCAN", df, str(min_samples), str(eps)) # Perform OPTICS clustering and visualize results optics = OPTICS(min_samples=min_samples, metric=metric).fit(df) df["labels"] = optics.labels_ plot_clusters("OPTICS", df, str(min_samples)) reachability_plot(df, optics)
It is interesting to mention that in this particular case, even if we had not specified a single parameter of the DBSCAN method, we would have obtained the same results because the default values are the same as the ones we used in our example: eps=0.5, min_samples=5 and metric=‘euclidean’.
In the figure below, we can see that DBSCAN managed to identify two clusters with a higher density and only marked one point of the first cluster as noise. On the other hand, it failed to find the third sparse cluster. This is an example that illustrates the essential drawback of the DBSCAN algorithm that we have already mentioned, which is that it does not give good results in the case of clusters with varying densities.
A simple solution could be to increase the value of the eps parameter in order to identify the third cluster correctly. However, this is a very naive approach because it can very likely lead to the incorrect merging of the first two clusters into one.
Figure 7. DBSCAN clustering result (Anja Plakalovic)
When using the OPTICS algorithm, it is sufficient to specify only the min_samples parameter. In order to ensure that the same metric is used in both DBSCAN and the OPTICS algorithm, it is necessary to explicitly specify the Euclidean distance as a metric for the OPTICS algorithm because the default metric is the Minkowski distance.
From the figure below, we can see that in this particular example, the OPTICS algorithm gives a convincingly better result than the DBSCAN algorithm and correctly identifies all three clusters.
Figure 8. OPTICS clustering result (Anja Plakalovic)
Since no cluster method was explicitly specified, the default Xi extraction method with a parameter xi=0.05 was used to extract clusters using the calculated reachability and ordering. The reachability plot showing the ordering returned by the OPTICS algorithm visibly has 3 valleys corresponding to the three identified clusters. We can see that the densest cluster 2 has the deepest valley, while the sparsest cluster 3 has the shallowest valley.
Figure 9. OPTICS: Reachability plot (Anja Plakalovic)
Conclusion
Clustering is the process of grouping related objects together based on their common characteristics. Although we may initially think that this is a simple task, that is not the case. Clustering indeed is one of the processes in which humans outperform deterministic approaches or computers. Humans, in fact, can visually cluster data remarkably well without any prior training. Clustering, although being pretty effortless and fast for humans, represents a challenging task for computers.
Over the years, different clustering approaches have been developed with varying performance levels. Among these approaches, density-based clustering algorithms stand out with their capability of finding clusters with various scales, shapes, and densities without requiring any prior information about setting the number of clusters. The two most frequently used density-based clustering algorithms are DBSCAN and OPTICS.
The DBSCAN algorithm finds core points of high density and expands clusters from them. This algorithm performs well on data that contains clusters of similar density. Unlike DBSCAN, OPTICS generates a hierarchical clustering result for a variable neighborhood radius and is better suited for usage on large datasets containing clusters of varying density. However, it is important to note that OPTICS has certain disadvantages compared to the DBSCAN algorithm. The two most significant disadvantages are memory cost and runtime complexity.
The real challenge of clustering lies in finding the appropriate algorithm and setting the optimal values of its parameters, especially in the case of high-dimensional data. In this blog, the simplest examples involving the use of two-dimensional data are used. In high-dimensional data, the distance between dataset points becomes less informative. This is often referred to as one of the “curses of dimensionality.” In the real world, the generated data is usually high-dimensional, which further often requires a certain type of data preprocessing prior to using clustering algorithms. Some preprocessing techniques can be used are dimensionality reduction, feature selection, or feature extraction.
Clustering can be used in a range of applications. For example, a hospital might utilize clustering to identify patient subgroups depending on their medical conditions. This can assist doctors to tailor treatments to the specific needs of each patient subgroup. Clustering can also be used in e-commerce to segment customers based on their buying behaviors. This can help the retailer to target specific customer groups with personalized marketing campaigns and promotions.
References
[1] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, KDD, vol. 96, no. 34, pp. 226-231, 1996.
[2] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: ordering points to identify the clustering structure”, ACM Sigmod record 28, no. 2, pp. 49-60, 1999.
“Clustering Algorithms: DBSCAN vs. OPTICS” Tech Bite was brought to you by Anja Plakalović, Junior Data Analyst at Atlantbh.
Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.