## C++ fast hierarchical clustering algorithms

This is a simplified C++ interface to the fast implementations of hierarchical clustering by Daniel Müllner. The original library with interfaces to R and Python can be found on danifold.net and is described in:

- Daniel Müllner:
*fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python.*Journal of Statistical Software 53, no. 9, pp. 1-18 (2013)

#### About

Hierarchical, or agglomerative clustering is a powerful technique for partitioning a set of observables. In contrast to clustering schemes like K-means, hierarchical clustering does not require the observables to be members of a vector space, but it works on a distance matrix and is thus applicable to arbitrary observables for which a distance metric can be defined.

Daniel Müllner has compared the performance of different hierarchical
clustering algorithms and implemented the fastest of them in C++ with R
and Python interfaces. Whilst these interfaces are described in his
R journal article, direct
use of the underlying C++ functions is tricky and undocumented.
To simplify this, I have written a C++ interface that hides the intricacies
of the internal output formats behind a single function (*hclust_fast*)
and provides two simple functions for the actual partitioning step
(*cutree_k* and *cutree_cdist*).

#### Usage

A complete usage example can be found in the file *demo.cpp*
in the source package. See the file *README* for details.

The interface for the basic clustering algorithms requires a condensed
distance matrix, which is the upper triangle (without the diagonal elements)
of the full distance matrix. Here is an example for its construction
where *n* is the number of observables *x*:

double* distmat = new double[(n*(n-1))/2]; int k,i,j; for (i=k=0; i<n; i++) { for (j=i+1; j<n; j++) { // compute distance between observables i and j distmat[k] = distance(x[i], x[j]); k++; } }

The actual clustering is done by the function *hclust_fast*, which
supports four methods for defining a cluster distance from individual
distances (*HCLUST_METHOD_SINGLE, HCLUST_METHOD_COMPLETE,
HCLUST_METHOD_AVERAGE, HCLUST_METHOD_MEDIAN*):

int* merge = new int[2*(n-1)]; double* height = new double[n-1]; hclust_fast(n, distmat, HCLUST_METHOD_SINGLE, merge, height);

*height* is filled with the cluster distance for each step. This
can be useful for automatically determining the clustering break point,
e.g, with the "elbow method".
*merge* contains the dendrogram in the encoding of the R
function *hclust*. Fortunately, you do not need to understand this
encoding, because there are two function for further processing the output:

int* labels = new int[n]; // partitioning into nclust clusters cutree_k(n, merge, nclust, labels); // stop clustering at step with custer distance >= cdist cutree_cdist(npoints, merge, height, cdist, labels);

*labels[i]* is filled with the cluster label
of observable *x[i]*. Cluster labels start at zero.
Final note: do not forget to free the memory after you are done with
the variables:

delete[] distmat; delete[] merge; delete[] height; delete[] labels;

#### Download

A copy of the source code of the library together with a demo program, sample data, and a script for visualization of the clustering result can be found on Github. Here is the latest file release:

- hclust-cpp.tgz (version 1.2 from 2020/01/13)

#### Authors and license

The fastcluster algorithm implementation is copyrighted by Daniel Müllner. The simplified C++ interface and the sample program is copyrighted by Christoph Dalitz. The complete code can be freely used under the terms a two-clause BSD-style license. See the file LICENSE in the source package for details.