Week 6, Wednesday – Statistically Speaking

Today I learned how and when to use DBSCAN (Denisty-Based Spatial Clustering for Applications with Noise). It is a clustering method that identifies clusters of data points in a space based on their density. It does not require us to specify the number of clusters in advance, like in k-means. It can find groups of arbitrary forms.

According to DBSCAN, a cluster is a dense region of data points separated by sparser regions. It divides points into three types: core, boundary, and noise.

A core point has a minimal number of neighbouring points within a given distance or epsilon. A border point has fewer neighbours than the min_samples yet is in the neighbourhood of a core point. The remaining points are noise and do not belong to any cluster. Dbscan chooses a data point at random first, and if it is a core point, it creates a new cluster with all of its neighbours who are reachable from this cluster. These neighbours could be either core or border points. Repeat for neighbours, adding their reachable neighbours to the cluster. Continue until there are no more points to add to the cluster. Now return to an unvisited point and repeat the process. One of its main advantages is that it is resistant to outliers.

Leave a Reply Cancel reply