Site icon R-bloggers

Generate datasets to understand some clustering algorithms behavior

[This article was first published on Stanislas Morbieu - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In order to understand how a clustering algorithm works, good sample datasets are useful to highlight its behavior under certain circumstances. This post shows how to generate 9 datasets:

Mixture of two Gaussians

Our first dataset consists of a mixture of two Gaussians.

Let’s first define a function to generate a Gaussian. This function takes four arguments:

  • a number of samples to generate, n,
  • a center, corresponding to the mean of the Gaussian,
  • a covariance matrix, sigma,
  • and a cluster label.
library(mvtnorm)
library(dplyr)

generateGaussianData <- function(n, center, sigma, label) {
  data = rmvnorm(n, mean = center, sigma = sigma)
  data = data.frame(data)
  names(data) = c("x", "y")
  data = data %>% mutate(class=factor(label))
  data
}

We use the function rmvnorm from the package mvtnorm to generate random numbers following a multivariate normal distribution.

The package dplyr is used to manipulate dataframes, especially the %>% operator allows to pass the variable on the left of the operator as the first argument of the function on the right. It is convenient since we can read the processing pipeline from the left to the right, the transformation steps being the functions between the %>% operators. For instance, the mutate function creates a new column named class.

We can now generate our first dataset:

dataset1 <- {
  # cluster 1
  n = 500
  center = c(5, 5)
  sigma = matrix(c(1, 0, 0, 1), nrow = 2)
  data1 = generateGaussianData(n, center, sigma, 1)

  # cluster 2
  n = 500
  center = c(1, 1)
  sigma = matrix(c(1, 0, 0, 1), nrow = 2)
  data2 = generateGaussianData(n, center, sigma, 2)

  # all data
  data = bind_rows(data1, data2)

  data$dataset = "1 - Mixture of Gaussians"

  data
}

We defined two clusters of equal size (500 samples each), with an identity covariance matrix. The function bind_rows is defined by the dplyr package and stacks the two dataframes data1 and data2. The variable data is thus a dataframe with \(500 + 500 = 1000\) rows, each row being an observation.

We can see how the dataframe looks like with:

str(dataset1)

It outputs:

'data.frame':        1000 obs. of  4 variables:
 $ x      : num  3,79 5,09 5,50 4,50 4,94 ...
 $ y      : num  5,35 5,95 6,52 4,82 5,52 ...
 $ class  : chr  "1" "1" "1" "1" ...
 $ dataset: chr  "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" "1 - Mixture of Gaussians" ...

We can also visualize this dataset:

library(ggplot2)

dataset1 %>% ggplot(aes(x=x, y=y, shape=class)) +
  geom_point() +
  coord_fixed() +
  scale_shape_manual(values=c(2, 3))

We used the ggplot2 package to have a scatter plot with geam_point. The function coord_fixed() is here to have the same scale for the two axis, and we define the symbols to use for the points with scale_shape_manual(values=c(2, 3)). We take the columns x and y as coordinates and the shape is given by the class column.

Here is the plot we get:

This example is very specific since the variances of the two classes are equal, there is no covariance between x and y inside a class, and the two classes have the same size (i.e. number of samples per class). It makes some algorithms (k-means for instance) behave well, i.e. able to recover the true classes.

Mixture of two Gaussians with different sizes

For the second dataset, we generate two Gaussians as before, but this time, the two classes have distinct sizes:

dataset2 <- {
  # cluster 1
  n = 2000
  center = c(5, 5)
  sigma = matrix(c(1, 0, 0, 1), nrow = 2)
  data1 = generateGaussianData(n, center, sigma, 1)

  # cluster 2
  n = 50
  center = c(1, 1)
  sigma = matrix(c(1, 0, 0, 1), nrow = 2)
  data2 = generateGaussianData(n, center, sigma, 2)

  # all data
  data = bind_rows(data1, data2)

  data$dataset = "2 - Different sizes"

  data
}

We can plot it as before:

dataset2 %>% ggplot(aes(x=x, y=y, shape=class)) +
  geom_point() +
  coord_fixed() +
  scale_shape_manual(values=c(2, 3))

It produces:

We will see in a following post that this simple relaxation of the assumption on the sizes of the classes can make some algorithms (k-means for instance) behave differently, so that they cannot recover the true classes.

Mixture of two Gaussians with different variances

The third dataset is also generated by two Gaussians but with different variances:

dataset3 <- {
  # cluster 1
  n = 100
  center = c(5, 5)
  sigma = matrix(c(3, 0, 0, 3), nrow = 2)
  data1 = generateGaussianData(n, center, sigma, 1)

  # cluster 2
  n = 100
  center = c(1, 1)
  sigma = matrix(c(0.1, 0, 0, 0.1), nrow = 2)
  data2 = generateGaussianData(n, center, sigma, 2)

  # all data
  data = bind_rows(data1, data2)

  data$dataset = "3 - Different variances"

  data
}

The corresponding plot is:

Again, a simple relaxation, this time on the variances, makes some algorithms unable to recover the true classes.

Mixture of Gaussians with non zero covariance

The fourth dataset consists of two Gaussians, but this time with high covariance:

dataset4 <- {
  # cluster 1
  n = 500
  center = c(4, 4)
  sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
  data1 = generateGaussianData(n, center, sigma, 1)

  # cluster 2
  n = 500
  center = c(5, 5)
  sigma = matrix(c(1, -0.99, -0.99, 1), nrow = 2)
  data2 = generateGaussianData(n, center, sigma, 2)

  # all data
  data = bind_rows(data1, data2)

  data$dataset = "4 - Non zero covariance"

  data
}

The corresponding plot is:

Mixture of Gaussians with different sizes, variances and covariance

The Gaussian distribution has two parameters: the mean and the covariance matrix. When generating a mixture of Gaussians, a third parameter step in. Let’s see how to combine the previous examples: changing all these parameters at the same time:

dataset5 <- {
  # cluster 1
  n = 1000
  center = c(1, 1)
  sigma = matrix(c(3, -0.9, -0.9, 3), nrow = 2)
  data1 = generateGaussianData(n, center, sigma, 1)

  # cluster 2
  n = 50
  center = c(5, 5)
  sigma = matrix(c(1, 0.9, 0.9, 1), nrow = 2)
  data2 = generateGaussianData(n, center, sigma, 2)

  # all data
  data = bind_rows(data1, data2)

  data$dataset = "5 - Disparate Gaussians"

  data
}

The corresponding plot is:

We can now move on to another distribution.

Spherical data

Some real world datasets are inherently spherical, i.e. the points are lying on the surface of a sphere, so generating a spherical dataset is helpful to understand how an algorithm behave on this kind of data, in a controlled environment (we know our dataset better when we generate it). Our spherical dataset consists of a mixture of data following a Von Mises-Fisher distribution.

Let’s first define a function to generate one class, i.e. one Von Mises-Fisher distribution:

library(movMF)

generateMFData <- function(n, theta, cluster, scale=1) {
  data = rmovMF(n, theta)
  data = data.frame(data[,1], data[,2])
  data = scale * data
  names(data) = c("x", "y")
  data = data %>% mutate(class=factor(cluster))
  data
}

The function rmovMF from the movMF package generates samples given the two parameters of the Von Mises-Fisher distribution. We can now generate our mixture:

dataset6 <- {
  # cluster 1
  n = 100
  data1 = generateMFData(n, 1 * c(-1, 1), 1, 5)

  # cluster 2
  n = 100
  data2 = generateMFData(n, 1 * c(1, -1), 2, 1)

  # all data
  data = bind_rows(data1, data2)

  data$dataset = "6 - Spherical classes"

  data
}

The corresponding plot is:

We can see that all the points are lying on the surface of the hypersphere.

Spirals

We are going to generate the following dataset:

For this, we will create data for one spiral, by creating columns for:

  • the angle which spreads evenly between angleStart and angleStart + 2.5 * pi,
  • the radius which decrements for each sample,
  • the x and y coordinates, based on the radius and the angle.

The second spiral is the same as the first one, except we take the negative values of x and y.

We then combine the two dataframes and add some noise, and shift it so the coordinates stay in the same range as in the other datasets.

Here is the complete code to produce it:

generateSpiralData <- function(n) {
  maxRadius = 7
  xShift = 2.5
  yShift = 2.5
  angleStart = 2.5 * pi
  noiseVariance = 0.1

  # first spiral
  firstSpiral <- function() {
    d1 = data.frame(0:(n-1))
    colnames(d1) <- c("i")
    d1 %>% mutate(angle = angleStart + 2.5 * pi * (i / n),
                       radius = maxRadius * (n + n/5 - i)/ (n + n/5),
                       x = radius * sin(angle),
                       y = radius * cos(angle),
                       class=1)
  }
  d1 = firstSpiral()

  # second spiral
  d2 = d1 %>% mutate(x = -x, y = -y, class=2)

  # combine, add noise, and shift
  generateNoise <- function(n) {
    sigma = matrix(c(noiseVariance, 0, 0, noiseVariance), nrow = 2)
    noise = rmvnorm(n, mean = c(0, 0), sigma = sigma)
    df = data.frame(noise)
    colnames(df) <- c("xNoise", "yNoise")
    df
  }
  d1 %>%
    bind_rows(d2) %>%
    bind_cols(generateNoise(2*n)) %>%
    transmute(x = x + xShift + xNoise,
              y = y + yShift + yNoise,
              dataset = "7 - Spirals",
              class = factor(class))
}

dataset7 = generateSpiralData(500)

We used the transmute function from dplyr to create new columns and remove the other ones (used only temporarily).

Uniform data

We can generate data from the uniform distribution with the function runif:

dataset8 <- {
  generateUniformData <- function(cluster, minX, maxX) {
    x = runif(500, min = minX, max = maxX)
    y = runif(500, min = -4, max = 9)
    data.frame(x, y) %>% mutate(class=cluster)
  }

  generateUniformData(1, -4, 1) %>% bind_rows(generateUniformData(2, 3, 9)) %>%
    mutate(dataset = "8 - Uniform data",
           class = factor(class))
}

The corresponding plot is:

Unclustered data

The last dataset is almost the same as the previous one but the ranges for the uniform distributions are the same, so it is impossible to distinguish between the two classes:

dataset9 <- {
  generateuniformdata <- function(cluster) {
    x = runif(500, min = -4, max = 9)
    y = runif(500, min = -4, max = 9)
    data.frame(x, y) %>% mutate(class=cluster)
  }

  generateuniformdata(1) %>% bind_rows(generateuniformdata(2)) %>%
    mutate(dataset = "9 - unclustered data",
           class = factor(class))
}

The corresponding plot is:

All datasets in one plot

To get a big picture of all the datasets we generated, we can combine them all, in a single dataframe, and plot it in a single plot:

data = bind_rows(dataset1, dataset2, dataset3, dataset4, dataset5, dataset6, dataset7, dataset8, dataset9)

scatterPlot <- function(data) {
  scatterPlot = ggplot(data, aes(x=x, y=y, shape=class)) +
    geom_point() +
    coord_fixed() +
    scale_shape_manual(values=c(2, 3)) +
    facet_wrap(~dataset, nrow=3)
  scatterPlot
}

scatterPlot(data)

The facet_wrap function of the ggplot2 package allows us to plot all the datasets in a single plot:

We will use these datasets in other posts to illustrate the behavior of some clustering algorithms.

To leave a comment for the author, please follow the link and comment on their blog: Stanislas Morbieu - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.