Density-Based Clustering Exercises
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius.
Advantages of density-based clustering:
- as mentioned above, it does not require a predefined number of clusters,
- clusters can be of any shape, including non-spherical ones,
- the technique is able to identify noise data (outliers).
Disadvantages:
- density-based clustering fails if there are no density drops between clusters,
- it is also sensitive to parameters that define density (radius and the minimum number of points); proper parameter setting may require domain knowledge.
There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and “mean-shift”.
This set of exercises covers basic techniques for using the DBSCAN method, and allows to compare its result to the results of the k-means clustering algorithm by means of the silhouette analysis.
The set requires the packages dbscan
, cluster
, and factoextra
to be installed. The exercises make use of the iris
data set, which is supplied with R, and the wholesale customers
data set from the University of California, Irvine (UCI) machine learning repository (download here).
Answers to the exercises are available here.
Exercise 1
Create a new data frame using all but the last variable from the iris
data set, which is supplied with R.
Exercise 2
Use the scale
function to normalize values of all variables in the new data set (with default settings). Ensure that the resulting object is of class data.frame
.
Exercise 3
Plot the distribution of distances between data points and their fifth nearest neighbors using the kNNdistplot
function from the dbscan
package.
Examine the plot and find a tentative threshold at which distances start increasing quickly. On the same plot, draw a horizontal line at the level of the threshold.
Exercise 4
Use the dbscan
function from the package of the same name to find density-based clusters in the data. Set the size of the epsilon neighborhood at the level of the found threshold, and set the number of minimum points in the epsilon region equal to 5.
Assign the value returned by the function to an object, and print that object.
Exercise 5
Plot the clusters with the fviz_cluster
function from the factoextra
package. Choose the geometry type to draw only points on the graph, and assign the ellipse
parameter value such that an outline around points of each cluster is not drawn.
(Note that the fviz_cluster
function produces a 2-dimensional plot. If the data set contains two variables those variables are used for plotting, if the number of variables is bigger the first two principal components are drawn.)
- Delve into various algorithms for classification such as KNN and see how they are applied in R
- Evaluate k-Means, Connectivity, Distribution, and Density based clustering
- And much more
Exercise 6
Examine the structure of the cluster object obtained in Exercise 4, and find the vector with cluster assignments. Make a copy of the data set, add the vector of cluster assignments to the data set, and print its first few lines.
Exercise 7
Now look at what happens if you change the epsilon value.
- Plot again the distribution of distances between data points and their fifth nearest neighbors (with the
kNNdistplot
function, as in Exercise 3). On that plot, draw horizontal lines at levels 1.8, 0.5, and 0.4. - Use the
dbscan
function to find clusters in the data with the epsilon set at these values (as in Exercise 4). - Plot the results (as in the Exercise 5, but now set the
ellipse
parameter value such that an outline around points is drawn).
Exercise 8
This exercise shows how the DBSCAN algorithm can be used as a way to detect outliers:
- Load the
Wholesale customers
data set, and delete all variables with the exception ofFresh
andMilk
. Assign the data set to thecustomers
variable. - Discover clusters using the steps from Exercises 2-5: scale the data, choose an epsilon value, find clusters, and plot them. Set the number of minimum points to 5. Use the
db_clusters_customers
variable to store the output of thedbscan
function.
Exercise 9
Compare the results obtained in the previous exercise with the results of the k-means algorithm. First, find clusters using this algorithm:
- Use the same data set, but get rid of outliers for both variables (here the outliers may be defined as values beyond 2.5 standard deviations from the mean; note that the values are already expressed in unit of standard deviation about the mean). Assign the new data set to the
customers_core
variable. - Use
kmeans
function to obtain an object with cluster assignments. Set the number of centers equal to 4, and the number of initial random sets (thenstart
parameter) equal to 10. Assign the obtained object to the variablekm_clusters_customers
variable. - Plot clusters using the
fviz_cluster
function (as in the previous exercise).
Exercise 10
Now compare the results of DBSCAN and k-means using silhouette analysis:
- Retrieve a vector of cluster assignments from the
db_clusters_customers
object. - Calculate distances between data points in the
customers
data set using thedist
function (with the default parameters). - Use the vector and the distances object as inputs into the
silhouette
function from thecluster
package to get a silhouette information object. - Plot that object with the
fviz_silhouette
function from thefactoextra
package. - Repeat the steps described above for the
km_clusters_customers
object and thecustomers_core
data sets. - Compare two plots and the average silhouette width values.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.