Notebook 3 : Unsupervised learning

Notebook prepared by Chloé-Agathe Azencott, modified by Victor Laigle.

In this notebook we will explore several techniques for dimension reduction and clustering.

# Load numpy, pandas and matplotlib (with aliases np, pd and plt respectively)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc('font', **{'size': 12}) # set the global font size for plots (in pt)

1. Principal Components Analysis (PCA)

In this section, we are going to perform a principal component analysis on a dataset describing the scores obtained by the best athletes who participated in a decathlon event in 2004, at the Athens Olympic Games or at the Talence Decastar.

Data loading

The data are contained in the decathlon.txt file.

The file contains 42 rows and 13 columns.

The first line is a header that describes the contents of the columns.

The following lines describe the 41 athletes.

The first 10 columns contain the scores obtained in the different events of the Decathlon. The 11th column contains the ranking. The 12th column contains the number of points obtained. The 13th column contains a categorical variable that specifies the competition concerned (Olympics or Decastar).

We will start by downloading the data from the github repository to your current working directory, and load the data into a pandas dataframe.

If you already downloaded the data, you can use the alternative suggested in the next cell.

!wget https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems/main/data/decathlon.txt

my_data = pd.read_csv('decathlon.txt', sep="\t")
--2026-05-13 09:40:26--  https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems/main/data/decathlon.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3579 (3.5K) [text/plain]
Saving to: ‘decathlon.txt.1’


decathlon.txt.1       0%[                    ]       0  --.-KB/s               
decathlon.txt.1     100%[===================>]   3.50K  --.-KB/s    in 0s      

2026-05-13 09:40:26 (56.8 MB/s) - ‘decathlon.txt.1’ saved [3579/3579]

Alternatively: If you’re not on Colab and have already downloaded the file to a data folder, uncomment and run the following cell:

# my_data = pd.read_csv('data/decathlon.txt', sep="\t")  # Read the data into a dataframe
my_data.head()
100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus Pole.vault Javeline 1500m Rank Points Competition
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 5.02 63.19 291.7 1 8217 Decastar
CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 4.92 60.15 301.5 2 8122 Decastar
KARPOV 11.02 7.30 14.77 2.04 48.37 14.09 48.95 4.92 50.31 300.2 3 8099 Decastar
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 5.32 62.77 280.1 4 8067 Decastar
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 4.72 63.44 276.4 5 8036 Decastar

Visualization

A scatter matrix is a visualisation, in k x k panels, of the pairwise relations between k variables:

  • on the diagonal, the histogram of each variable
  • off-diagonal, the scatter plots between two variables (non standardized)

Your job here is to display the visualization with the scatter_matrix function:

from pandas.plotting import scatter_matrix

### START OF YOUR CODE

scatter_matrix(my_data, s=60, figsize=(18, 18));

### END OF YOUR CODE

You can also limit the visualization to a few variables for clearer observations. Display the scatter_matrix for the 3 or 4 variables which seem most correlated for example:

### START OF YOUR CODE

scatter_matrix(my_data[['Shot.put', 'High.jump', '400m']],
               alpha=0.5, s=60, figsize=(8, 8));

### END OF YOUR CODE

Alternatively, the seaborn library la librairie seaborn enables more elaborated visualizations than matplotlib. You can explore, for instance, the capabilities of jointplot.

import seaborn as sns
sns.set_style('whitegrid')

sns.jointplot(x='Shot.put', y='400m', data = my_data,
              height=6, space=0, color='b')

sns.jointplot(x='Rank', y='Points', data = my_data,
              kind='reg', height=6, space=0, color='b');

We will now perform a principal components analysis of the scores in the 10 events.

Let’s start by extracting the predictive variables

X = np.array(my_data.drop(columns=['Points', 'Rank', 'Competition']))
print(X.shape)
(41, 10)

Data standardization

After visualizing the data, we can notice different data scales and distributions depending on the variables. We therefore reapply the procedure seen in previous labs to standardize our data: we need a StandardScaler object contained in the preprocessing module of sklearn.

### START OF YOUR CODE

# Import the module
from sklearn import preprocessing
# Create the StandardScaler object
std_scaler = preprocessing.StandardScaler()

# Fit the object to the data
std_scaler.fit(X)

# Transform the data
X_scaled = std_scaler.transform(X)

### END OF YOUR CODE

print(X_scaled.shape)
(41, 10)

Computation of the principal components

scikit-learn’s algorithms for matrix factorization are included in the decomposition module. For PCA, refer to: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

from sklearn import decomposition

Note: We have few variables here so we can afford to calculate all the PCs.

As we’ve seen in previous labs, most algorithms implemented in scikit-learn work as follows:

  • we instantiate an object, corresponding to a type of algorithm/model, with its hyperparameters (here the number of components)
  • we use the fit method to pass the data to this algorithm
  • the learned parameters are now accessible as attributes of this object
# Instantiation of a PCA object with 10 principal components

### START OF YOUR CODE
pca = decomposition.PCA(n_components=10)
# We now pass the standardized data to this object
# This is where the computations are performed
pca.fit(X_scaled)

### END OF YOUR CODE
PCA(n_components=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Proportion of variance explained by the PCs

We will now plot the proportion of variance explained by the different components. This is accessible in the attribute explained_variance_ratio_ of our pca object.

plt.plot(np.arange(1, 11), pca.explained_variance_ratio_, marker='o')

plt.xlabel("Number of principal components")
plt.ylabel("Proportion of variance explained")
plt.show()

We can also display the cumulative proportion of variance explained, with the function cumsum from numpy (imported above under the alias np)

Create a similar graph to the one just above, displaying the cumulative proportion of variance explained as a function of the number of principal components considered.

### START OF YOUR CODE

plt.plot(np.arange(1, 11), np.cumsum(pca.explained_variance_ratio_), marker='o')

plt.xlabel("Number of principal components")
plt.title("Cumulative proportion of variance explained")
plt.show()

### END OF YOUR CODE

Questions:

  • What is the proportion of variance explained by the first two components ?
  • How many components would we need to use to explain 80% of the data’s variance ?
  • What is the proportion of variance explained by the first two components ? The first two principal components explain approximately 0.50 (or 50%) of the total variance.

  • How many components would we need to use to explain 80% of the data’s variance ? We would need to use 5 components to explain 80% of the data’s variance.

Data projection on the first two principal components

We will now use only the first two principal components.

Let’s start by computing the new data representation, i.e. their projection on these two PCs.

X_projected = pca.transform(X_scaled)
print(X_projected.shape)
(41, 10)

We can display a scatter plot representing the data along these two PCs.

fig = plt.figure(figsize=(5, 5))

plt.scatter(X_projected[:, 0], X_projected[:, 1])

plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.show()

We can now color each point of the above scatter plot according to the ranking of the athlete it represents.

fig = plt.figure(figsize=(5, 5))

plt.scatter(X_projected[:, 0], X_projected[:, 1], c=my_data['Rank'])

plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.colorbar(label='Rank')
plt.show()

Question: What can we conclude about the interpretation of PC1 ?

  • Athletes with lower ranks (meaning better performance, as rank 1 is best) tend to be on the left side of the plot, corresponding to more negative values on PC1.
  • Athletes with higher ranks (meaning worse performance) tend to be on the right side of the plot, corresponding to more positive values on PC1.

Interpretation of the first two principal components

Each principal component is a linear combination of the variables describing the data. The weights of this linear combination are accessible in pca.components_.

We can now visualize, not the individuals as above, but the 10 variables in the space of the 2 PCs.

pcs = pca.components_
print(pcs[0])
[ 0.42829627 -0.41015201 -0.34414444 -0.31619436  0.3757157   0.41255442
 -0.30542571 -0.02783081 -0.15319802  0.03210733]
fig = plt.figure(figsize=(6, 6))

plt.scatter(pcs[0], pcs[1])
for (x_coordinate, y_coordinate, feature_name) in zip(pcs[0], pcs[1], my_data.columns[:10]):
    plt.text(x_coordinate + 0.01, y_coordinate + 0.01, feature_name)

plt.xlabel("Contribution to PC1")
plt.ylabel("Contribution to PC2")
plt.show()

Questions:

  • Which variables have contributions very similar to the two principal components ?
  • What can we deduce from their similarity ?
  • How to interpret the sign of the variables’ contributions to the first principal component ?
  • Which variables have contributions very similar to the two principal components ?

    • Variables with similar contributions to PC1: ‘100m’, ‘400m’, ‘110m.hurdle’ are on the positive side of PC1. ‘Long.jump’, ‘Shot.put’, ‘Discus’, ‘Javeline’, ‘High.jump’, ‘Pole.vault’ are on the negative side of PC1.
    • Variables with similar contributions to PC2: almost all events have positive contributions to PC2, with the exception of ‘Long.jump’ and ‘Pole.vault’ events that have negative contributions.
  • What can we deduce from their similarity ?

    • Variables that have similar contributions to a principal component are often correlated with each other. For example, ‘100m’, ‘400m’, and ‘110m.hurdle’ are all speed-based running events. Their similar contributions to PC1 indicate that athletes who perform well (or poorly) in one of these events tend to perform similarly in the others. The same applies to the throwing and jumping events.
    • When variables are far apart on the plot, especially on opposite sides of the origin, it suggests they are negatively correlated. For instance, good performance in speed events (positive PC1) seems to be inversely related to good performance in throwing/jumping events (negative PC1).
  • How to interpret the sign of the variables’ contributions to the first principal component ? The sign of a variable’s contribution to a principal component indicates the direction of its relationship with that component:

    • Positive contributions: ‘100m’, ‘400m’, ‘110m.hurdle’ (sprint/running events)
    • Negative contributions: ‘Long.jump’, ‘Shot.put’, ‘High.jump’, ‘Discus’, ‘Pole.vault’, ‘Javeline’ (field events, throwing/jumping)

    PC1 separates athletes primarily based on their performance in running events versus throwing/jumping events. A high positive value along PC1 indicates better performance in running events (lower times result in higher scores in the decathlon, and in standardized data, this means being ‘above average’ for these events). Conversely, a high negative value along PC1 suggests stronger performance in throwing and jumping events. In essence, PC1 captures a trade-off or distinction between speed-based events and power/technique-based events.

2. « Olivetti » data

We will now use dimension reduction to represent, in two dimensions, a dataset of human faces. This is a classic dataset, containing 400 pictures of 64 by 64 pixels. These are pictures of the face of 40 different people, 10 pictures per person, labeled by a class number between 0 and 39 to identify the person.

We can load this dataset directly thanks to scikit-learn:

from sklearn import datasets
data = datasets.fetch_olivetti_faces()

If you do not manage to download the data:

  • Go to : https://github.com/CroncLee/PCA-face-recognition/blob/master/olivetti_py3.pkz
  • Download the file (there is a Download button)
  • Use the command
    data = datasets.fetch_olivetti_faces(data_home="<PATH TO DATA>")

Replace “PATH TO DATA” by the path of the folder where you saved the data.

X = data.data
y = data.target
print(X.shape)
(400, 4096)
print("The dataset contains %d classes" % len(np.unique(y)))
The dataset contains 40 classes

Each image is represented by one value for each pixel (grey scale).

We may visualize these images, under the condition that we reorganize the values (= a vector of length 4096) in 64x64 matrices. For example, for image with index 77:

plt.imshow(X[77, :].reshape((64, 64)), interpolation='nearest', cmap=plt.cm.gray);

PCA

Let’s start with a principal components analysis as in the previous section:

pca = decomposition.PCA(n_components=2)
X_transformed_pca = pca.fit_transform(X)

Each picture is now represented, not by 4096 variables, but by 2 variables. We can visualize them on a scatter plot and color them according to their class:

plt.scatter(X_transformed_pca[:, 0], X_transformed_pca[:, 1], c=y)
plt.xlabel("First principal component")
plt.ylabel("Second principal component")
plt.show()

Question: Do pictures of the same face (= same class) have close or far apart representations ?

  • Pictures of the same face generally tend to group together, but with significant overlap. While there are visible clusters for some individuals, many classes (colors) blend into each other, indicating that PCA with only two components does not perfectly separate all 40 individuals.
  • This suggests that PCA, while effective for general dimensionality reduction by preserving global variance, may not be the optimal method for class separation and visualization when many classes are present. It captures differences, but not always enough to clearly distinguish all unique identities in a low-dimensional space without overlap.

We can visualize the contribution of each pixel to the first principal component:

plt.imshow(pca.components_[0, :].reshape((64,64)), interpolation='nearest', cmap=plt.cm.gray);

And the contribution of each pixel to the second principal component:

plt.imshow(pca.components_[1, :].reshape((64,64)), interpolation='nearest', cmap=plt.cm.gray);

Question: Which interpretation can we draw from these two images representing the contributions of each pixel to the first two principal components ?

  • First Principal Component (PC1): The first image shows a pattern that seems to highlight general facial features like the forehead, eyes, and nose bridge. Areas with strong positive or negative contributions indicate pixels that vary most along this principal component. For faces, PC1 often captures variations related to overall lightness/darkness, face width, or general head posture. For example, a strong positive contribution in the forehead area and negative in the lower face might indicate a PC related to the vertical position of facial features or head tilt.

  • Second Principal Component (PC2): The second image shows a different pattern of contributions. It might highlight variations in specific features, such as the eyes, mouth shape, or jawline. For instance, a strong contrast (positive vs. negative contributions) between the left and right sides of the face could indicate variations in head rotation or asymmetry. It could also represent differences in lighting conditions that emphasize certain areas more than others.

General remark: These images are essentially “eigenfaces.” Each eigenface represents a fundamental direction of variance in the dataset. When we project a new face onto these principal components, the coefficients tell us how much of each eigenface is present in that new face. By looking at the eigenfaces, we can infer that PC1 captures the most significant overall variations (e.g., general face structure or lighting), while PC2 captures the second most significant variations, often related to more specific features or orientations orthogonal to PC1.

tSNE

Let’s try another dimension reduction method to better separate our classes.

We will use the same approach as for PCA, but with the tSNE algorithm, thanks to the TSNE class of the manifold module.

Now it’s your turn to:

  • create the tsne object from the class cited above. Here we want to reduce our dataset to two dimensions in order to visualize it.
  • assign the information of our dataset to this object
  • transform our data to obtain new coordinates in two dimensions
from sklearn import manifold
### START OF YOUR CODE

tsne = manifold.TSNE(n_components=2, init='random', learning_rate='auto')
X_transformed = tsne.fit_transform(X)

### END OF YOUR CODE

Let’s display the result of the tSNE dimension reduction:

plt.scatter(X_transformed[:, 0], X_transformed[:, 1], c=y)
plt.xlabel("Première composante tSNE")
plt.ylabel("Deuxième composante tSNE");

Influence of the perplexity parameter

The main hyperparameter influencing the representation obtained by the tSNE algorithm is the “perplexity” parameter. This represents the number of neighbours for which distances are preserved. It therefore influences the preservation of the local structure (low perplexity) or global structure (high perplexity). The representation obtained can vary greatly depending on this parameter.

Test different values of the perplexity parameter and display the corresponding results.

### START OF YOUR CODE

tsne_low_perp = manifold.TSNE(n_components=2, init='random', learning_rate='auto', perplexity=1)
X_transformed_low_perp = tsne_low_perp.fit_transform(X)

### END OF YOUR CODE
plt.scatter(X_transformed_low_perp[:, 0], X_transformed_low_perp[:, 1], c=y)
plt.xlabel("First tSNE component")
plt.ylabel("Second tSNE component")
plt.title("tSNE (low perplexity)");

### START OF YOUR CODE

tsne_high_perp = manifold.TSNE(n_components=2, init='random', learning_rate='auto', perplexity=100)
X_transformed_high_perp = tsne_high_perp.fit_transform(X)

### END OF YOUR CODE
plt.scatter(X_transformed_high_perp[:, 0], X_transformed_high_perp[:, 1], c=y)
plt.xlabel("First tSNE component")
plt.ylabel("Second tSNE component")
plt.title("tSNE (high perplexity)");

3. Clustering

We will now cover another set of unsupervised methods, used to group observations according to their similarities.

We’ll start by generating three two-dimensional datasets:

  • 4 separated blobs from normal distributions
  • 2 nested semi-circles (or “semi-moons”)
  • 2 concentric circles
from sklearn import datasets
# nombre de points
n_samples = 1000

four_blobs, four_blobs_labels = datasets.make_blobs(n_samples=n_samples, centers=4, n_features=2, random_state=170)

moons, moons_labels = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=170)

circles, circles_labels = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.05, random_state=170)

First, let’s visualize these data:

fig, ax = plt.subplots(1, 3, figsize=(10, 3))

ax[0].scatter(four_blobs[:, 0], four_blobs[:, 1], c=four_blobs_labels, s=20, alpha=0.7, cmap='viridis')
ax[1].scatter(moons[:, 0], moons[:, 1], c=moons_labels, s=20, alpha=0.7, cmap='viridis')
ax[2].scatter(circles[:, 0], circles[:, 1], c=circles_labels, s=20, alpha=0.7, cmap='viridis')

ax[0].set_title('4 blobs')
ax[1].set_title('Moons')
ax[2].set_title('Circles');

Now, let’s assume that we do not have access to the labels. Which clustering algorithms allow us to recover the clusters corresponding respectively to the four blobs, two moons and two circles ?

K-means algorithm

The objective of the k-means algorithm is to find \(K\) clusters (and their centroid \(\mu_k\)) so as to minimize the intra-cluster variance:

\[\begin{align} V = \sum_{k = 1}^{K} \sum_{x \in C_k} \frac{1}{|C_k|} (\|x - \mu_k\|^2) \end{align}\]

Manual implementation

We’ll start by implementing the algorithm “by hand”, step by step, to fully understand and visualize what the algorithm does. We will then see how to use directly the K-means implementation in the sklearn library. The different steps of the algorithm are:

  1. Select the number k of clusters (hyperparameter)
  2. Initialize the k centroids at random among our data points
  3. Compute the distances from all data points to these centroids
  4. Assign each point to the cluster of the closest centroid
  5. Compute the position of the new centroids
  6. Repeat steps 3 to 5 until convergence, i.e. until the centroids no longer change from one iteration to the next

We will implement this algorithm on the 4 blobs dataset, starting with the choice of k and with the random selection of k points from our dataset that will constitute the initial centroids:

np.random.seed(23)  # set seed. Important for k-means step by step visualisation. With other seeds, it's not as clear how the algorithm works.

k = 4
random_indices = np.random.choice(len(four_blobs), k, replace=False)
centroids_step0 = four_blobs[random_indices]

for i, centroid in enumerate(centroids_step0):
    print(f"Centroid {i} : x = {centroid[0]},\ty = {centroid[1]}")
Centroid 0 : x = 3.5679755708830445,    y = -0.7411643083568404
Centroid 1 : x = -4.58197076050192, y = 0.1707016264777655
Centroid 2 : x = -9.582073371322199,    y = -6.079460047487451
Centroid 3 : x = -7.990366325280352,    y = -5.347131046114068

Let’s look at the data with these initial centroids (red crosses)

fig, ax = plt.subplots(1, 1, figsize=(4, 4))

ax.scatter(four_blobs[:, 0], four_blobs[:, 1], c='grey', s=20, alpha=0.7)
ax.scatter(centroids_step0[:, 0], centroids_step0[:, 1], c='red', marker='x')

plt.show()

We can now compute the distances between each data point and these centroids

# Euclidean distance between data point and centroid
def compute_distance(data_point, centroid):
    dist = np.sqrt(np.sum((data_point - centroid)**2))
    return dist

def compute_all_distances(dataset, centroids):
    # Initialize distances array
    distances = np.zeros((k, n_samples))  # k and n_samples defined above

    # Calculate distance from each point to each centroid
    for i in range(k):
        for j in range(n_samples):
            distances[i, j] = compute_distance(dataset[j], centroids[i])

    return distances
distances = compute_all_distances(four_blobs, centroids_step0)

# Example of distances for the first 5 points to the k centroids
print("Distances :")
print("\t\t", " \t\t".join([f"Point {i+1}" for i in range(5)]))
for i in range(k):
    print(f"Centroid {i}\t", "\t".join(distances[i, :5].astype(str).tolist()))
Distances :
         Point 1        Point 2         Point 3         Point 4         Point 5
Centroid 0   11.784247411011116 14.758813013069012  9.466528460382866   9.451368197501498   13.843809467072518
Centroid 1   5.758643667904218  8.776449714052207   14.254003753639761  13.797528285545447  7.706191288891633
Centroid 2   2.448740366311389  0.8050123894604602  15.465578634003116  14.741265477233672  0.34858815305556784
Centroid 3   0.7400184315424295 2.3987055841506097  14.158075744365863  13.461171066113893  1.4045646722216132

We now assign to each point the cluster corresponding to the closest centroid. We use the argmin function from numpy for that. In a way it can be seen as intermediate predicted labels.

def assign_cluster(distances):
    assignments = np.argmin(distances, axis=0)
    return assignments

intermediate_labels = assign_cluster(distances)
print(intermediate_labels[:5])  # examples of intermediate labels assigned to the data points
[3 2 0 0 2]

We can verify that the intermediate labels assigned indeed correspond to the closest centroid by comparing with the distances calculated above (see previous cell).

Let’s visualize these intermediate clusters and the initial centroids on a scatter plot

def visualise_kmeans(dataset, labels, centroids, ax):

    ax.scatter(dataset[:, 0], dataset[:, 1], c=labels, s=20, alpha=0.7, cmap='viridis')
    ax.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x')

fig, ax = plt.subplots(1, 1, figsize=(4, 4))
visualise_kmeans(four_blobs, intermediate_labels, centroids_step0, ax)
plt.show()

We now need to compute the position of the new centroids, which we will plot on a new graph

def compute_new_centroids(dataset, labels):
    centroids = np.zeros((k, dataset.shape[1]))

    for i in range(k):
        centroids[i] = dataset[labels == i].mean(axis=0)

    return centroids

centroids_step1 = compute_new_centroids(four_blobs, intermediate_labels)

fig, axes = plt.subplots(1, 2, figsize=(9, 4))
visualise_kmeans(four_blobs, intermediate_labels, centroids_step0, axes[0])
visualise_kmeans(four_blobs, intermediate_labels, centroids_step1, axes[1])

axes[0].set_title("Initial centroids")
axes[1].set_title("Centroids after one iteration")

plt.show()

We now repeat the different steps:

  • computation of distances to centroids,
  • assignment of clusters to data points,
  • computation of new centroids
fig, axes = plt.subplots(1, 5, figsize=(24, 4))
n_iter = 10  # Number of iterations of the algorithm

current_fig = 0

centroids = centroids_step1
for i in range(n_iter):

    distances = compute_all_distances(four_blobs, centroids)
    intermediate_labels = assign_cluster(distances)

    # We display the visualization every 2 iterations
    if (i+1) % 2 == 0:
        visualise_kmeans(four_blobs, intermediate_labels, centroids, axes[current_fig])
        axes[current_fig].set_title(f"Iteration {i+1}")
        current_fig += 1

    centroids = compute_new_centroids(four_blobs, intermediate_labels)

plt.show()

We clearly see here how, through successive iterations, the algorithm is able to correctly identify our clusters, gradually moving the centroids and readjusting the clusters’ membership of our data points.

Question: In which case(s) can the algorithm yield bad results ?

The K-means algorithm can yield bad results in several cases:

  1. Non-Globular or Non-Convex Clusters: K-means assumes that clusters are spherical (globular) and of similar size and density. It works by finding centroids and assigning points to the closest centroid, which implicitly defines convex boundaries. If the true clusters have irregular shapes (e.g., crescent moons, concentric circles, or elongated shapes), K-means will struggle to separate them effectively, often splitting a single true cluster or merging multiple true clusters.

  2. Varying Densities or Sizes: If clusters have significantly different densities or sizes, K-means may fail. It tends to create clusters of roughly equal variance. A denser, smaller cluster might be incorrectly absorbed into a larger, sparser cluster if the centroid of the larger cluster is closer on average.

  3. Sensitivity to Initial Centroids: The algorithm’s outcome can be highly dependent on the initial placement of the centroids. If the initial centroids are poorly chosen (e.g., all fall within a single natural cluster or are outliers), K-means might converge to a local optimum rather than the global optimum, leading to suboptimal or incorrect clustering. This is why running K-means multiple times with different initializations (e.g., n_init parameter in sklearn.cluster.KMeans) is a common practice.

  4. Presence of Outliers: K-means assigns every data point to a cluster. Outliers, which are far from any true cluster, can disproportionately influence the position of centroids, distorting the clusters and leading to incorrect assignments for other points.

Implementation with sklearn

Let’s come back to the 3 datasets created earlier (4 blobs, semi-moons and concentric circles), and apply the to them the sklearn implementation of the K-means algorithm, which is an optimized version of the steps we have just seen.

Documentation : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

from sklearn import cluster
# START OF YOUR CODE

# Instantiation of the three k-means models with the
# theoretical number of clusters (resp. 4, 2 and 2):
kmeans_four_blobs = cluster.KMeans(n_clusters=4)
kmeans_moons = cluster.KMeans(n_clusters=2)
kmeans_circles = cluster.KMeans(n_clusters=2)

# Application to the data
kmeans_four_blobs.fit(four_blobs)
kmeans_moons.fit(moons)
kmeans_circles.fit(circles)

### END OF YOUR CODE
KMeans(n_clusters=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The .labels_ attribute contains, for each observation, the number of the cluster to which this observation has been assigned.

kmeans_four_blobs.labels_[0:10]  # Example of predicted labels for the first 10 data points
array([1, 1, 0, 0, 1, 2, 0, 3, 2, 0], dtype=int32)

Let’s visualize what the final clustering looks like for our three datasets:

# Clustering visualization
fig, ax = plt.subplots(1, 3, figsize=(10, 3))

ax[0].scatter(four_blobs[:, 0], four_blobs[:, 1], c=kmeans_four_blobs.labels_, cmap='viridis')
ax[1].scatter(moons[:, 0], moons[:, 1], c=kmeans_moons.labels_, cmap='viridis')
ax[2].scatter(circles[:, 0], circles[:, 1], c=kmeans_circles.labels_, cmap='viridis')

ax[0].set_title('4 blobs (k=4)')
ax[1].set_title('Moons (k=2)')
ax[2].set_title('Circles (k=2)');

Questions:

  • Is it the expected clustering ?
  • In which case(s) does the k-means algorithm work correctly ?
  • Why doesn’t it work in the other case(s) ?
  • Is it the expected clustering?
    • For the ‘4 blobs’ dataset, the clustering is as expected. K-means successfully separated the four distinct, spherical clusters.
    • For the ‘Moons’ dataset, the clustering is NOT as expected. K-means tries to divide the data into two convex regions, splitting the natural ‘moon’ shapes across its boundaries.
    • For the ‘Circles’ dataset, the clustering is NOT as expected. K-means fails to identify the concentric circles and instead creates two clusters by drawing a linear boundary, as it struggles with non-convex shapes.
  • In which case(s) does the k-means algorithm work correctly? The K-means algorithm works correctly and is well-suited for datasets where:
    • Clusters are spherical or globular: K-means is designed to find compact, hyperspherical clusters.
    • Clusters are of similar size and density: It tends to create clusters of roughly equal variance.
    • Clusters are well-separated: There should be clear boundaries between the groups.
    • The ‘4 blobs’ dataset is a perfect example where K-means performs well because the clusters meet these criteria.
  • Why doesn’t it work in the other case(s)? K-means fails for the ‘Moons’ and ‘Circles’ datasets because:
    • Assumes Convex (Globular) Shapes: K-means partitions the data space into Voronoi cells, which means it implicitly assumes that clusters are convex. It tries to find a centroid and assign all points closest to that centroid to the same cluster. This approach cannot correctly identify non-convex shapes like the ‘moons’ or ‘circles’ datasets.
    • Relies on Euclidean Distance to Centroids: The algorithm minimizes the sum of squared distances to cluster centers. This metric inherently favors spherical clusters. For the ‘Moons’ and ‘Circles’ datasets, points that are topologically part of the same cluster might be geometrically further from their ‘true’ centroid than from another cluster’s centroid, leading to incorrect assignments.
    • In the ‘Moons’ dataset, K-means splits each moon into two, trying to make them spherical. In the ‘Circles’ dataset, it attempts to separate the inner and outer circles with a straight line, which is inappropriate for concentric structures.

Finding \(K\) with the silhouette coefficient

It is often the case that the exact number of clusters, \(K\), is not known in advance. We can still apply the k-means algorithm and measure the clustering performance to find the best parameter \(K\). One of the metrics used is the silhouette coefficient.

The silhouette coefficient (or score) enables to compare the average intra- and inter-cluster distances:

\[\begin{align} \text{score} = \frac{b - a}{\max(a, b)} \end{align}\]

with (for each sample):

  • \(a\) the average intra-cluster distance
  • \(b\) the average nearest-cluster distance

The score is computed for each observation (with a value between -1 (worst) and 1 (best)), then the average score enables to assess the clustering on the whole point cloud at once.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score

from sklearn import metrics
print(f"4 blobs: Silhouette coefficient for k-means (k=4) : %.2f" %
      metrics.silhouette_score(four_blobs, kmeans_four_blobs.labels_))
print(f"Moons: Silhouette coefficient for  k-means (k=2) : %.2f" %
      metrics.silhouette_score(moons, kmeans_moons.labels_))
print(f"Circles: Silhouette coefficient for  k-means (k=2) : %.2f" %
      metrics.silhouette_score(circles, kmeans_circles.labels_))
4 blobs: Silhouette coefficient for k-means (k=4) : 0.76
Moons: Silhouette coefficient for  k-means (k=2) : 0.49
Circles: Silhouette coefficient for  k-means (k=2) : 0.35

Let’s try to evaluate the performance of the clustering as a function of the number of clusters \(K\), by varying it in the range [2, .., 8]

k_values = range(2, 9)
names = ['4 blobs', 'Moons', 'Circles']
fig, ax = plt.subplots(2, 3, figsize=(16, 10))

for i, dataset in enumerate([four_blobs, moons, circles]):
    silhouettes = []

    for kval in k_values:

        ### START OF YOUR CODE

        # Initialize a KMeans model with the number of cluster tested:
        kmeans_k = cluster.KMeans(n_clusters=kval)

        # Train the model on the data
        kmeans_k.fit(dataset)

        # Append the silhouette score obtained to the list
        silhouettes.append(metrics.silhouette_score(dataset, kmeans_k.labels_))

        ### END OF YOUR CODE

    # Visualization of the silhouette score
    ax[0,i].plot(k_values, silhouettes)
    ax[0,i].set_xlabel("K")
    ax[0,i].set_ylabel("Silhouette score")
    ax[0,i].set_title(names[i])

    print("Dataset:", names[i])
    best_silhouette = np.max(silhouettes)
    print("Optimal silhouette coefficient: %.2f" % best_silhouette)
    best_K = k_values[silhouettes.index(best_silhouette)]
    print("Corresponding number of cluster K: %.0f" % best_K)

    # Final clustering with the best K
    kmeans_k = cluster.KMeans(n_clusters=best_K)
    kmeans_k.fit(dataset)
    ax[1,i].scatter(dataset[:, 0], dataset[:, 1], c=kmeans_k.labels_, cmap='viridis')
    ax[1,i].set_xlabel('x1')
    ax[1,i].set_ylabel('x2')
    ax[1,i].set_title('Clustering with ' + str(best_K) + ' clusters')
fig.tight_layout()

Conclusions:

  • The k-means algorithm provides a satisfactory clustering for the dataset with four well-separated blobs, even without knowing the ideal number of clusters in advance, in which case the silhouette coefficient allows to find this ideal number.
  • However, despite optimizing the silhouette score, this algorithm does not provide good results for the other datasets, whether for the two nested moons or the two concentric circles.

We will therefore now try another clustering algorithm and test its performance to compare it to k-means. We will limit ourselves to the two datasets for which the k-means algorithm does not work.

DBSCAN (Density-based clustering)

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm works in two steps:

  • All sufficiently close observations are connected to each other.
  • Observations with a minimum number of connected neighbors are considered core samples, from which the clusters are expanded. All observations sufficiently close to a core sample belong to the same cluster as this one.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

The DBSCAN algorithm takes two hyperparameters as input:

  • eps: the size of the neighborhood, in other words, the distance between two data points below which one point is considered inside the neighborhood of the other.
  • min_samples: the minimum number of neighbors for a data point to be considered a core sample.
### START OF YOUR CODE

# Initialization of the two DBSCAN models
# with hyperparameters eps=0.2, min_samples=2:
dbscan_moons = cluster.DBSCAN(eps=0.2, min_samples=2)
dbscan_circles = cluster.DBSCAN(eps=0.2, min_samples=2)

# Fit to the data
dbscan_moons.fit(moons)
dbscan_circles.fit(circles)

### END OF YOUR CODE
DBSCAN(eps=0.2, min_samples=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Again, the .labels_ attribute contains, for each observation, the number of the cluster to which this observation has been assigned.

print("Number of labels for the moons dataset:", len(np.unique(dbscan_moons.labels_)))
print("Number of labels for the circles dataset:", len(np.unique(dbscan_circles.labels_)))
Number of labels for the moons dataset: 2
Number of labels for the circles dataset: 2

Let’s visualize the clusters obtained:

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].scatter(moons[:, 0], moons[:, 1], c=dbscan_moons.labels_, cmap='viridis')
ax[0].set_title("Clustering DBSCAN (eps=0.2)")

ax[1].scatter(circles[:, 0], circles[:, 1], c=dbscan_circles.labels_, cmap='viridis')
ax[1].set_title("Clustering DBSCAN (eps=0.2)");

We can see here that the DBSCAN algorithm is able to identify the two respective clusters in the two datasets that were problematic for the k-means algorithm.

We can also note that we did not need to provide a number of clusters beforehand for the algorithm to correctly identify the right number of clusters. However, the algorithm is sensitive to the two hyperparameters mentioned above, which we will now evaluate.

Role of the neighborhood size parameter (eps)

We will assess the influence of the eps parameter on the concentric circles dataset (circles)

### START OF YOUR CODE

# Choose low and high values for eps, for example 0.05 and 2.0
eps_low = 0.05
eps_high = 2.

# Instantiation of a DBSCAN clustering object with low eps and another one with high eps
dbscan_low = cluster.DBSCAN(eps=eps_low, min_samples=2)
dbscan_high = cluster.DBSCAN(eps=eps_high, min_samples=2)

# Fit to the data
dbscan_low.fit(circles)
dbscan_high.fit(circles)

### END OF YOUR CODE
DBSCAN(eps=2.0, min_samples=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
### START OF YOUR CODE

print(np.unique(dbscan_low.labels_))
print(np.unique(dbscan_high.labels_))

### END OF YOUR CODE
[-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
 47 48]
[0]

Question: What is the number of clusters obtained in our two models ? (Use the .labels attribute)

Using np.unique(dbscan_low.labels_) and np.unique(dbscan_high.labels_):

  • For dbscan_low (eps=0.05), the number of clusters is 50 (including the noise points labeled as -1). This indicates that with a very small eps, DBSCAN perceives many small, dense groupings, and also flags many points as noise.

  • For dbscan_high (eps=2.0), the number of clusters is 1 (all points are grouped into a single cluster labeled 0). With a very large eps, almost all points fall within each other’s neighborhood, leading to a single large cluster. The output shows [0], indicating one cluster (0) and no noise (-1) which means all points are connected.

fig = plt.figure(figsize=(5, 5))

outliers = np.where(dbscan_low.labels_ == -1)[0]
plt.scatter(circles[outliers, 0], circles[outliers, 1], marker='*', color='red')

non_outliers = np.where(dbscan_low.labels_ != -1)[0]
plt.scatter(circles[non_outliers, 0], circles[non_outliers, 1], c=dbscan_low.labels_[non_outliers])
plt.title(f"Clustering DBSCAN (eps={eps_low})");

fig = plt.figure(figsize=(5, 5))
plt.scatter(circles[:, 0], circles[:, 1], c=dbscan_high.labels_)
plt.title(f"Clustering DBSCAN (eps={eps_high})");

Find eps with the silhouette coefficient

print("Silhouette coefficient for DBSCAN (eps=0.2) : %.2f" %
      metrics.silhouette_score(circles, dbscan_circles.labels_))
Silhouette coefficient for DBSCAN (eps=0.2) : 0.11
eps_values = np.logspace(-3, 1, 40)
silhouettes = []

for eps in eps_values:
    dbscan_eps = cluster.DBSCAN(eps=eps, min_samples=2)
    dbscan_eps.fit(circles)
    if len(np.unique(dbscan_eps.labels_)) > 1:  # Needed to compute the silhouette coeff
        silhouettes.append(metrics.silhouette_score(circles, dbscan_eps.labels_))
    else:
        silhouettes.append(-1)
plt.plot(eps_values, silhouettes)
plt.xscale("log")
plt.xlabel("eps (log scale)")
plt.ylabel("silhouette");

best_silhouette = np.max(silhouettes)
print(f"Optimal silhouette coefficient: {best_silhouette:.2f}")
print(f"Corresponding eps: {eps_values[silhouettes.index(best_silhouette)]:.2f}")
Optimal silhouette coefficient: 0.13
Corresponding eps: 0.03

Let’s see what the clustering looks like using the eps parameter yielding the best silhouette coefficient:

best_eps = eps_values[silhouettes.index(best_silhouette)]

dbscan_best = cluster.DBSCAN(eps=best_eps, min_samples=2)
dbscan_best.fit(circles)

fig = plt.figure(figsize=(5, 5))
plt.scatter(circles[:, 0], circles[:, 1], c=dbscan_best.labels_)
plt.title(f"Clustering DBSCAN (eps={best_eps:.2f})");

Question:

  • What is the problem here ?
  • Is the silhouette coefficient adapted to our dataset ?
  • What is the problem here? The problem is that despite optimizing the eps parameter using the silhouette coefficient, the DBSCAN algorithm with min_samples=2 failed to correctly identify the two concentric circles. Instead, it seems to have identified only one large cluster (cluster 0) and a significant number of noise points (-1), or a few small clusters that don’t represent the true structure. The optimal eps value found (around 0.03) is too small to connect the points within the rings, especially the outer ring, leading to fragmented clusters and many points being labeled as noise, or merging them into one if eps is too large.

  • Is the silhouette coefficient adapted to our dataset? The silhouette coefficient, while useful for evaluating clustering performance, has limitations that make it less suitable for certain types of datasets, especially those with non-convex clusters like concentric circles. The silhouette score is based on the average distance to points within the same cluster versus the average distance to points in the nearest neighboring cluster. For non-convex shapes, points that are geometrically close but belong to different clusters (e.g., points on the outer edge of the inner circle and inner edge of the outer circle) can confuse the metric. It tends to favor convex, dense, and well-separated clusters. In this case, even if a clustering algorithm perfectly separates the two concentric circles, the silhouette score might not be very high because points from the two distinct circles are geometrically close, which can lead to a lower ‘b’ value and thus a lower score. Therefore, for datasets with complex, non-globular structures like concentric circles, the silhouette coefficient might not be the most appropriate evaluation metric, and other metrics like the Adjusted Rand Index (if true labels are known) or visual inspection might be more informative.

Adjusted Rand Index

The adjusted Rand index enables to compare a clustering’s result with labels. For each pair of observations, we look at whether they belong to the same cluster or not, in the predicted and the true clustering. The index takes values between 0 (random clustering) and 1 (perfect clustering).

Documentation : https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

Question: Why not use an evaluation metric from classification models here ?

We cannot directly use evaluation metrics from classification models (like accuracy, precision, recall, F1-score) in clustering for the following key reasons:

  1. Unsupervised vs. Supervised Learning:
    • Classification is a form of supervised learning. This means that during training, the model is given both the input features (X) and the corresponding true labels (y) for each data point. Evaluation metrics like accuracy compare the model’s predicted labels directly against these known true labels, matching them one-to-one.
    • Clustering is a form of unsupervised learning. The algorithm is only given the input features (X) and no true labels. Its goal is to discover inherent groupings or structures within the data. While we might have ground truth labels available for evaluation purposes (as in the case of circles_labels or penguins_labels), the clustering algorithm itself does not use them during the clustering process.
  2. Lack of Fixed Label Mapping:
    • In classification, the labels have fixed, predefined meanings (e.g., ‘cat’, ‘dog’; ‘spam’, ‘not spam’).
    • In clustering, the labels assigned by the algorithm (e.g., 0, 1, 2) are arbitrary. Cluster 0 from one run of K-means might correspond to what the true labels call ‘category A’, while cluster 0 from another run (or a different algorithm) might correspond to ‘category B’. There is no intrinsic meaning to the cluster numbers themselves, only to the grouping of data points within them.

Because of this arbitrary nature of cluster labels, a direct comparison using accuracy (e.g., if DBSCAN assigns 0 to the inner circle and 1 to the outer circle, and the true labels are 0 and 1 respectively, but DBSCAN’s 0 actually matches the true 1 and vice-versa) would be misleading. Standard classification metrics would incorrectly report low performance even if the clusters perfectly match the underlying structure. Metrics like the Adjusted Rand Index (ARI) address this by comparing pairs of points: whether they are grouped together or separated in both the predicted and true clusterings, rather than relying on the exact label values.

print("Adjusted Rand Index of K-means (K=2) : %.2f" % metrics.adjusted_rand_score(circles_labels, kmeans_circles.labels_))
Adjusted Rand Index of K-means (K=2) : -0.00
print("Adjusted Rand Index of DBSCAN (eps=0.2) : %.2f" % metrics.adjusted_rand_score(circles_labels, dbscan_circles.labels_))
Adjusted Rand Index of DBSCAN (eps=0.2) : 1.00

Conclusion

This notebook provided a comprehensive exploration of unsupervised learning techniques for dimensionality reduction and clustering.

We began with Principal Components Analysis (PCA), applying it first to the Decathlon dataset to understand variance explanation, project data into lower dimensions, and interpret the meaning of principal components in terms of event performance. We then used PCA on the Olivetti faces dataset, visualizing the resulting 2D projections and interpreting the pixel contributions to the principal components as ‘eigenfaces’. We also briefly introduced t-Distributed Stochastic Neighbor Embedding (t-SNE) as an alternative for visualizing high-dimensional data, noting its ability to better separate classes and the influence of the ‘perplexity’ parameter.

Next, we delved into Clustering Algorithms using synthetically generated datasets (blobs, moons, circles):

  • We manually implemented the K-means algorithm step-by-step to understand its iterative process of centroid assignment and recalculation. We then used sklearn’s implementation, demonstrating its effectiveness on globular clusters (like the 4 blobs) but highlighting its limitations with non-convex or irregularly shaped clusters (like moons and circles).
  • We explored the silhouette coefficient as a metric to evaluate clustering performance and select the optimal number of clusters (k). We observed its effectiveness for well-separated, globular clusters but also discussed its shortcomings when dealing with complex cluster geometries where points from different clusters might be geometrically close.
  • We then introduced DBSCAN (Density-Based Spatial Clustering of Applications with Noise), showcasing its ability to correctly identify non-convex clusters, such as the ‘moons’ and ‘circles’ datasets, where K-means struggled. We investigated the influence of its eps parameter and further discussed the limitations of the silhouette coefficient in evaluating DBSCAN on such datasets.
  • Finally, we learned about the Adjusted Rand Index (ARI), a robust metric for comparing clustering results against known ground truth labels. We specifically highlighted why traditional classification metrics are unsuitable for clustering due to the arbitrary nature of cluster labels and the unsupervised nature of the task, reinforcing the need for metrics like ARI for a fair comparison.

Bonus: Clustering on Penguins

The goal here is to test several unsupervised clustering methods on a new dataset, and to compare them. Try the methods seen above such as sklearn.cluster.KMeans or sklearn.cluster.DBSCAN. You can also try other methods such as Gaussian mixtures (sklearn.mixture.GaussianMixture).

Which methods do you think should work better? Why?

The dataset we suggest you use here concerns different species of penguins and some physical characteristics. The 3 species are: Adelie penguins, Gentoo penguins and Chinstrap penguins.

!wget https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems-en/main/data/penguins_data.csv

palmerpenguins = pd.read_csv("penguins_data.csv")
--2026-05-13 09:40:45--  https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems-en/main/data/penguins_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27561 (27K) [text/plain]
Saving to: ‘penguins_data.csv.1’

penguins_data.csv.1 100%[===================>]  26.92K  --.-KB/s    in 0s      

2026-05-13 09:40:45 (132 MB/s) - ‘penguins_data.csv.1’ saved [27561/27561]

Alternatively: If you already downloaded the data, uncomment the following line:

# palmerpenguins = pd.read_csv("data/penguins_data.csv")
palmerpenguins.head()
species label island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Penguin (Pygoscelis adeliae) adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Penguin (Pygoscelis adeliae) adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Penguin (Pygoscelis adeliae) adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Penguin (Pygoscelis adeliae) adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Penguin (Pygoscelis adeliae) adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE

The first step is to filter some data to avoid missing values (NA or NaN).

palmerpenguins = palmerpenguins[palmerpenguins['bill_depth_mm'].notna()]
palmerpenguins = palmerpenguins.reset_index(drop=True)

We will focus only on two characteristics of our penguins: beak length and body mass

penguins_X = np.array(palmerpenguins[["bill_length_mm", "body_mass_g"]])
from sklearn import preprocessing

Our variables have very different scales, as we can see in the data displayed above. Therefore, we need to standardize our data.

# Standardization (centering and scaling)
penguins_X = preprocessing.StandardScaler().fit_transform(penguins_X)
species_names, species_int = np.unique(palmerpenguins.species, return_inverse=True)
penguins_labels = species_int
species_names
array(['Adelie Penguin (Pygoscelis adeliae)',
       'Chinstrap penguin (Pygoscelis antarctica)',
       'Gentoo penguin (Pygoscelis papua)'], dtype=object)

Let’s first visualize our data on a graph:

plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=penguins_labels)
plt.xlabel("bill_length_mm (scaled)")
plt.ylabel("body_mass_g (scaled)");

It’s now up to you to test different clustering algorithms on these data, and evaluate their performance. Will you be able to obtain a perfect clustering ?

In addition to the clustering algorithms used above, you can try Gaussian mixtures (GaussianMixture) for example, or other methods from the cluster or mixture modules. For each, provide the silhouette coefficients and adjusted Rand indices you obtain, as well as a visualization of the resulting clustering.

KMeans

# Initialization of a k-means model with k=3
kmeans = cluster.KMeans(n_clusters=3)

# Fit to the data
kmeans.fit(penguins_X)
KMeans(n_clusters=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=penguins_labels, marker='o')
plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=kmeans.labels_, marker='*')


plt.xlabel("bill_length_mm (scaled)")
plt.ylabel("body_mass_g (scaled)");

print("Silhouette coefficient for k-means (k=3) : %.2f" % metrics.silhouette_score(penguins_X, kmeans.labels_))
Silhouette coefficient for k-means (k=3) : 0.47
print("Adjusted Rand Index for K-means (K=3) : %.2f" % metrics.adjusted_rand_score(penguins_labels, kmeans.labels_))
Adjusted Rand Index for K-means (K=3) : 0.76

DBSCAN

eps_values = np.logspace(-3, 1, 40)
silhouettes = []

for eps in eps_values:
    dbscan_eps = cluster.DBSCAN(eps=eps, min_samples=2)
    dbscan_eps.fit(penguins_X)
    if len(np.unique(dbscan_eps.labels_)) > 1: # needed to compute silhouette coeff
        silhouettes.append(metrics.silhouette_score(penguins_X, dbscan_eps.labels_))
    else:
        silhouettes.append(-1)
plt.plot(eps_values, silhouettes)
plt.xscale("log")
plt.xlabel("eps (log scale)")
plt.ylabel("silhouette");

best_silhouette = np.max(silhouettes)
print("Optimal silhouette coefficient: %.2f" % best_silhouette)
best_eps = eps_values[silhouettes.index(best_silhouette)]
print("Corresponding eps: %.2f" % best_eps)
Optimal silhouette coefficient: 0.47
Corresponding eps: 0.74
dbscan_opt = cluster.DBSCAN(eps=best_eps, min_samples=2)
dbscan_opt.fit(penguins_X)
DBSCAN(eps=np.float64(0.7443803013251689), min_samples=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
np.unique(dbscan_opt.labels_)
array([-1,  0])
print("Adjusted Rand Index for DBSCAN : %.2f" % metrics.adjusted_rand_score(penguins_labels, dbscan_opt.labels_))
Adjusted Rand Index for DBSCAN : 0.00
#plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=penguins_labels, marker='o')
plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=dbscan_opt.labels_, marker='*')


plt.xlabel("bill_length_mm (scaled)")
plt.ylabel("body_mass_g (scaled)");

Gaussian mixture model

The Gaussian Mixture Model optimizes the parameters of a finite number of Gaussians to fit the data.

Documentation : https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

from sklearn import mixture
# Initialization of Gaussian Mixture model with 3 components
gmm = mixture.GaussianMixture(n_components=3)

# Fit to the data
gmm.fit(penguins_X)

# Clusters prediction
gmm_labels = gmm.predict(penguins_X)
#plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=penguins_labels, marker='o')
plt.scatter(penguins_X[:, 0], penguins_X[:, 1], c=gmm_labels, marker='*')


plt.xlabel("bill_length_mm (scaled)")
plt.ylabel("body_mass_g (scaled)");

print("Silhouette coefficient for GMM (k=3) : %.2f" % metrics.silhouette_score(penguins_X, gmm_labels))
Silhouette coefficient for GMM (k=3) : 0.47
print("Adjusted Rand Index for GMM (K=3) : %.2f" % metrics.adjusted_rand_score(penguins_labels, gmm_labels))
Adjusted Rand Index for GMM (K=3) : 0.79