Content

C++ fast hierarchical clustering algorithms

This is a simplified C++ interface to the fast implementations of hierarchical clustering by Daniel Müllner. The original library with interfaces to R and Python can be found on danifold.net and is described in:

About

Hierarchical, or agglomerative clustering is a powerful technique for partitioning a set of observables. In contrast to clustering schemes like K-means, hierarchical clustering does not require the observables to be members of a vector space, but it works on a distance matrix and is thus applicable to arbitrary observables for which a distance metric can be defined.

Daniel Müllner has compared the performance of different hierarchical clustering algorithms and implemented the fastest of them in C++ with R and Python interfaces. Whilst these interfaces are described in his R journal article, direct use of the underlying C++ functions is tricky and undocumented. To simplify this, I have written a C++ interface that hides the intricacies of the internal output formats behind a single function (hclust_fast) and provides two simple functions for the actual partitioning step (cutree_k and cutree_cdist).

Usage

A complete usage example can be found in the file demo.cpp in the source package. See the file README for details.

The interface for the basic clustering algorithms requires a condensed distance matrix, which is the upper triangle (without the diagonal elements) of the full distance matrix. Here is an example for its construction where n is the number of observables x:

double* distmat = new double[(n*(n-1))/2];
int k,i,j;
for (i=k=0; i<n; i++) {
  for (j=i+1; j<n; j++) {
    // compute distance between observables i and j  
    distmat[k] = distance(x[i], x[j]);
    k++;
  }
}

The actual clustering is done by the function hclust_fast, which supports four methods for defining a cluster distance from individual distances (HCLUST_METHOD_SINGLE, HCLUST_METHOD_COMPLETE, HCLUST_METHOD_AVERAGE, HCLUST_METHOD_MEDIAN):

int* merge = new int[2*(n-1)];
double* height = new double[n-1];
hclust_fast(n, distmat, HCLUST_METHOD_SINGLE, merge, height);

height is filled with the cluster distance for each step. This can be useful for automatically determining the clustering break point, e.g, with the "elbow method". merge contains the dendrogram in the encoding of the R function hclust. Fortunately, you do not need to understand this encoding, because there are two function for further processing the output:

int* labels = new int[n];
// partitioning into nclust clusters
cutree_k(n, merge, nclust, labels);
// stop clustering at step with custer distance >= cdist
cutree_cdist(npoints, merge, height, cdist, labels);

labels[i] is filled with the cluster label of observable x[i]. Cluster labels start at zero. Final note: do not forget to free the memory after you are done with the variables:

delete[] distmat;
delete[] merge;
delete[] height;
delete[] labels;

Download

A copy of the source code of the library together with a demo program, sample data, and a script for visualization of the clustering result can be found on Github. Here is the latest file release:

Authors and license

The fastcluster algorithm implementation is copyrighted by Daniel Müllner. The simplified C++ interface and the sample program is copyrighted by Christoph Dalitz. The complete code can be freely used under the terms a two-clause BSD-style license. See the file LICENSE in the source package for details.