In a previous post, we discussed how clustering requires us to select group assignments and group representatives that minimize \(J^{\text{clust}}\). But the two choices are circular, each depends on the other. To overcome this, we iterate between the two choices. We repeatedly alternate between updating the group assignments and then updating the representatives. In each successive step, the objective \(J^{\text{clust}}\) gets better (it decreases). This iteration is what the k-means algorithm does.


given a list of \(N\) vectors \(x_1, \cdots, x_N\), and an initial list of k group representative vectors \(z_1, \cdots, z_N\)

repeat until convergence

  1. Partition the vectors into k groups. For each vector \(i = 1, \cdots, N\), assign \(x_i\) to the group associated with the nearest representative.
  2. Update representatives. For each group \(j = 1, \cdots, k\), set \(z_j\) to be the mean of the vectors in group \(j\).


  • Ties in step 1 can be broken by assigning \(x_i\) to the group associated with one of the closest representatives with the smallest value of \(j\).

  • It is possible that in step 1, or more groups can be empty. In this case, we simply drop this group (and its representative). This leads to a partition of vectors with fewer than k groups.

  • If the group assignments found in step 1 are the same in two successive iterations, the representatives in step 2 will also be the same. It follows that group assignments and representatives won’t change in later iterations either, so we should stop the algorithm. That’s what convergence means. In practice, we often stop the algorithm earlier, as soon as the improvement in \(J^{\text{clust}}\) becomes very small.

  • We start the algorithm with a choice of initial group representatives. One simple method is to pick the representatives randomly from the original vectors; another is to start from a random assignment of the original vectors to k groups, and us eth means fo the groups as the initial representatives. There are even more sophisticated methods in the literature.


The fact that \(J^{\text{clust}}\) decreases in each step implies that the k-means algorithm converges in a finite number of steps. However, depending on the initial choice of representatives, the algorithm, can and does, converge to different final partitions, with different objective values.

The algorithm is heuristic, which means it cannot guarantee that the partition it finds minimizes our objective \(J^{\text{clust}}\). It is common to run the algorithm several times, with different initial choices, and choose the one with the smallest \(J^{\text{clust}}\).

Example scikit-learn

This is a copied example from Jake VanderPlas’ blog. There are a lot of intricacies related to how to use KMeans as implemented in sklean that I choose to omit here. Here are some examples of how to implement k-means from scratch.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

samples, clusters = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(samples[:, 0], samples[:, 1])

kmeans = KMeans(n_clusters=4)
## KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
##        n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
##        random_state=None, tol=0.0001, verbose=0)
y_kmeans = kmeans.predict(samples)
centers = kmeans.cluster_centers_

plt.scatter(samples[:, 0], samples[:, 1], c=y_kmeans)
plt.scatter(centers[:, 0], centers[:, 1], c="black", alpha=0.5, s=200)