K-Means Clustering

K-means clustering is a simple and widely used method for partitioning a dataset into distinct, non-overlapping clusters. Given observations , k-means aims to partition the observations into sets so as to minimize the within-cluster sum of squares.

Formally, the objective is to find

where is the mean, or centroid, of points in , that is,

Definition

The within-cluster sum of squares measures how tightly clustered the points are around their assigned centroids. Smaller values mean the points within each cluster are, on average, closer to one another.

Intuition

K-means tries to represent each cluster by a single prototype, namely its centroid. The algorithm succeeds when each point is close to the centroid of the cluster to which it is assigned.

Algorithm

The most common algorithm uses an iterative refinement procedure. Given an initial set of centroids , the algorithm alternates between two steps:

  1. Assignment step: assign each observation to the cluster with the nearest centroid.
  2. Update step: recompute each centroid as the mean of the points currently assigned to its cluster.

If denotes the set of points assigned to cluster at iteration , then the update is

Each iteration does not increase the objective, so the algorithm eventually converges to a local optimum. In practice, convergence is often detected when the assignments stop changing or when the centroids move by less than a small tolerance.

def k_means(data: np.ndarray, k: int, max_iters: int = 100):
    # data has shape (n, d)
    if k <= 0 or k > len(data):
        raise ValueError("k must satisfy 1 <= k <= len(data)")
 
    centroid_indices = np.random.choice(len(data), size=k, replace=False)
    centroids = [data[i].copy() for i in centroid_indices]
 
    for _ in range(max_iters):
        clusters = [[] for _ in range(k)]
 
        # Assign each point to the nearest centroid
        for x in data:
            distances = [distance(x, centroid) for centroid in centroids]
            min_idx = int(np.argmin(distances))
            clusters[min_idx].append(x)
 
        # Recalculate centroids as the mean of each cluster.
        new_centroids = []
        for i, cluster in enumerate(clusters):
            if cluster:
                new_centroid = calculateCentroid(cluster)
            else:
                # Keep the previous centroid if a cluster becomes empty.
                new_centroid = centroids[i]
            new_centroids.append(new_centroid)
 
        # Check for convergence
        if all(np.allclose(new, old) for new, old in zip(new_centroids, centroids)):
            break
 
        centroids = new_centroids
 
    return centroids, clusters

Warning

A few implementation details matter in practice. Comparing NumPy arrays with == does not give a reliable scalar convergence check, and empty clusters can occur after an assignment step. A production implementation should handle both cases explicitly.

Initialization Methods

Initialization matters because k-means can converge to different local optima depending on the starting centroids.

Common initialization methods:

  • The Forgy Method: randomly choose observations from the dataset and use these as the initial means. This method tends to spread the initial means out.
  • The Random Partition Method: Randomly assign a cluster to each observation and then proceed to the update step. This method tends to place all means close to the center of the data.
  • k-means++: choose initial centroids sequentially so that later centroids are more likely to be far from the existing ones. This is often preferred in practice because it typically leads to better and more stable solutions.

Properties

Important properties:

  1. K-means performs hard clustering: each point is assigned to exactly one cluster.
  2. The algorithm monotonically decreases the within-cluster sum of squares at each iteration.
  3. The final solution depends on initialization, since the objective is non-convex and the algorithm can converge to a local optimum.
  4. The method is most natural when Euclidean distance is meaningful and when clusters are roughly spherical and of similar scale.
  5. The centroid is the mean of the assigned points, so the method is sensitive to outliers.

Warning

K-means is not a general-purpose clustering method for every geometry. It can perform poorly when clusters are non-spherical, have very different sizes or densities, or when the features are measured on very different scales.

Because the algorithm uses distances and means, feature scaling often matters. In many applications, standardizing the features before running k-means is important.