K-means clustering

From Canonica AI

Introduction

K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoid.

A visual representation of a dataset being divided into distinct clusters by the K-means algorithm.
A visual representation of a dataset being divided into distinct clusters by the K-means algorithm.

Algorithm

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The standard algorithm was first proposed by Stuart Lloyd of Bell Labs in 1957 as a technique for pulse-code modulation, although it wasn't published as a journal article until 1982. In 1965, E. W. Forgy published essentially the same method, which is why it is sometimes referred to as Lloyd-Forgy.

The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community.

Standard Algorithm

Given an initial set of k means m1(1),...,mk(1) (see below), the algorithm proceeds by alternating between two steps:

  1. Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means).
  2. Update step: Calculate the new means (centroids) of the observations in the new clusters.

The algorithm has converged when the assignments no longer change. There is no guarantee that the optimum is found using this algorithm.

The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.

Initialization Methods

The quality of the initial set of means can significantly influence the final solution. These include:

  1. Forgy – randomly assigning a cluster center to each observation.
  2. Random Partition – assigning each observation to a cluster randomly, and then proceeding to evaluation of the initial means.
  3. Maximin – selecting initial set of means that are maximally distant from each other.
  4. K-means++ – choosing initial centers for k-means clustering in a smart way to speed up convergence.

Convergence and Iteration

The k-means algorithm converges to a result. The result may be a local optimum (i.e., not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.

Variations

There are many variations of the standard k-means algorithm. These include:

  1. K-medians: Instead of calculating the mean of the objects in a cluster to form the cluster's center, one instead calculates the median.
  2. K-medoids (also: Partitioning Around Medoids, PAM): Instead of calculating the mean of the objects in a cluster to form the cluster's center, one instead chooses one of the points in the cluster to be the center.
  3. K-means++: An algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm.
  4. Spherical k-means clustering: A variation of k-means clustering where instead of finding the mean of the objects in a cluster to form the cluster's center, one instead finds the median.

Applications

K-means clustering is used in various fields, from machine learning, computer vision, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

See Also