Cluster Analysis

What Is Cluster Analysis?

Cluster analysis involves applying clustering algorithms with the goal of finding hidden patterns or groupings in a data set. It is therefore used frequently in exploratory data analysis, but is also used for anomaly detection and preprocessing for supervised learning.

Clustering algorithms form groupings in such a way that data within a group (or cluster) have a higher measure of similarity than data in any other cluster. Various similarity measures can be used, including Euclidean, probabilistic, cosine distance, and correlation. Most unsupervised learning methods are a form of cluster analysis.

Clustering algorithms fall into two broad groups:

  1. Hard clustering, where each data point belongs to only one cluster, such as the popular k-means method.
  2. Soft clustering, where each data point can belong to more than one cluster, such as in Gaussian mixture models. Examples include phonemes in speech, which can be modeled as a combination of multiple base sounds, and genes that can be involved in multiple biological processes.
K-means is a cluster analysis method, which is used to separate data into clusters without supervision.

k-means clustering, which represents groups by their centroid—the average of each member, depicted by the stars in the figure above.

A Gaussian mixture model is a probabilistic cluster analysis method, which assigns cluster membership probabilities to data.

Gaussian mixture model, which assigns cluster membership probabilities, representing strength of association with different clusters.

Cluster analysis is used in a variety of domains and applications to identify patterns and sequences:

  • Clusters can represent the data instead of the raw signal in data compression methods.
  • Clusters indicate regions of images and lidar point clouds in segmentation algorithms.
  • Genetic clustering and sequence analysis are used in bioinformatics.

Clustering techniques are also used to establish similarity between labeled and unlabeled data in semi-supervised learning, where initial models are built with minimal labeled data, and used to assign labels to originally unlabeled data. By contrast, semi-supervised clustering incorporates available information about the clusters into the clustering process, such as if some observations are known to belong to the same cluster, or some clusters are associated with a particular outcome variable.

MATLAB® supports many popular cluster analysis algorithms:

  • Hierarchical clustering builds a multilevel hierarchy of clusters by creating a cluster tree.
  • k-means clustering partitions data into k distinct clusters based on distance to the centroid of a cluster.
  • Gaussian mixture models form clusters as a mixture of multivariate normal density components.
  • Spatial clustering (such as the popular density-based DBSCAN) groups points that are close to each other in areas of high density, keeping track of outliers in low-density regions. Can handle arbitrary non-convex shapes.
  • Self-organizing maps use neural networks that learn the topology and distribution of the data.
  • Spectral clustering transforms input data into a graph-based representation where the clusters are better separated than in the original feature space. The number of clusters can be estimated by studying eigenvalues of the graph.
  • Hidden Markov models can be used to discover patterns in sequences, such as genes and proteins in bioinformatics.

Key Points

  • Cluster analysis is frequently used in exploratory data analysis, for anomaly detection and segmentation, and as preprocessing for supervised learning.
  • k-means and hierarchical clustering remain popular, but for non-convex shapes more advanced techniques such as DBSCAN and spectral clustering are required.
  • Additional unsupervised methods that can be used to discover groupings in data include dimensionality reduction techniques and feature ranking.

Cluster Analysis Example in MATLAB

Using the imsegkmeans command (which uses the k-means algorithm), MATLAB assigned three clusters to the original image (tissue stained with hematoxylin and eosin), providing a segmentation of the tissue into three classes (represented as white, black, and grey). Try it yourself as well as related segmentation approaches in this code example.

Image of tissue stained with hematoxylin and eosin.
Image of tissue segmented into tree classes using cluster analysis.

See also: Statistics and Machine Learning Toolbox, Machine Learning with MATLAB, Image Processing Toolbox, probability distributions