Cluster analysis is a highly developed field within statistics. It takes as input a set X equipped with a distance function, and potentially some other information, and from that produces a partition of the set, i.e. a decomposition of the set into a collection of groups so that (a) the sets in the collection do not overlap and (b) every point in X is contained in some set in the collection. It is extremely useful because it provides a taxonomy of the points in a data set, since it frequently turns out that the groups can be characterized in a coherent and understandable way.

Topological Data Analysis (TDA) uses cluster analysis in building its networks, and builds on cluster analysis to provide additional precision in the taxonomies that are created. Some people conflate cluster analysis and TDA, which is not accurate. The goal of this post is to explain the differences between the methods.

Let’s first understand a simple clustering method, called single linkage clustering. We are given a set X equipped with a distance function. For simplicity, let’s suppose that it is a set of points in the plane, as pictured below.

CL1

The single linkage strategy is to select a threshold real number R, draw edges between any two points whose distance is less than or equal to R, and divide the data set up into the connected components of the resulting network. Here is what happens for a particular choice of the threshold R.

CL2

 The data set is divided up into six groups. Two are singletons. For a slightly larger value of R, we obtain this decomposition.

CL3

With an even larger value of R, we would obtain a single cluster. The challenge lies in knowing how to choose the threshold. Some heuristic criteria have been developed, but in general no particular mathematical methodology can be expected to provide the ideal answers in all situations. Statisticians have developed a method to avoid the selection of a threshold by encoding all threshold level behavior using a single object, a so-called dendrogram. The dendrogram for the above situation looks something like this.

CL4

This is a mathematical object called a tree. The leaves are the tips of the edges at the bottom of the diagram. Notice that they correspond to the data points – there are 14 of each. The height above the base containing the leaves reflects the clustering threshold R. The correspondence of the data points with the leaves is such that we can identify when two points are merged into the same cluster by tracing the branches until they merge, as indicated in the picture below.

    CL5  

The points a and b are connected at a threshold level slightly below 2. So all the information about the clustering at all thresholds are contained in a single diagram, which is very useful. Note for example that there are three large branches, which correspond to the three main groups visible in the picture of the data points. This diagram therefore gets around the problem of having to select a threshold, and can make clearer the effect of choosing a particular threshold. This method, and variants that also produce dendrograms, get around a very important problem in clustering methodologies, namely the requirement to choose a threshold. As a result, they constitute a large step in the direction of obtaining knowledge about data. There is more to be done, however, as is illustrated by a biomedical example.

The data in this example is a data set consisting of gene expression profiles of breast cancer patients, constructed at the Netherlands Cancer Institute. There are 272 rows, each consisting of expression levels for 1500 genes, and can therefore be regarded as a spreadsheet with 2723 rows and 1500 columns, each with a numerical value. The topological network for this data set looks as follows. It comes from the paper [1].

CL6

The region in the lower right, labelled c-MYB+ tumors, consists entirely of patients who survived the full length of the study. Only gene expression profiling was used to construct the network, without knowledge of any clinical outcomes or judgments. On the other hand, the dendrogram for this data set looks as follows.

CL7

Here the parts of the dendrogram that correspond to members of the “perfect survival group” mentioned above are colored in red. It is clear that there is a high level decomposition into two groups, and that all the members of the perfect survival group fall into one of those high level groups.

However, there are many non-members of the perfect survival group which lie in the same high level group, and it will be very difficult to identify where they fit within that group, due to the fact that they are dispersed among many of the leaves of the dendrogram. One would have been unlikely to discover them using only the dendrogram, whereas they are quite apparent in the TDA network displayed above.

The fundamental problem is that clustering is a methodology for breaking data sets apart into pieces, and that is its only tool. Once the data set is such that it contains progressions, and is in an appropriate sense “connected”, clustering methods will typically break things apart that belong together, and that phenomenon is exhibited in the above dendrogram.

The basic point is that representing some aspects of geometry beyond the breakup into clusters provides greater insight into the detail of the data set, which is difficult to recover using clustering alone.

The business implications follow closely the example from above.

Let’s say you are clustering loans – by successfully repaid and defaulted to determine the characteristics of those groups. In a standard clustering approach using the value of R or the more sophisticated dendrogram approach the outputs are likely to produce groups which are not as well-defined, and often will split groups of data points that belong together, . As a result, the local behavior of the defaulted loans, that behavior that you are trying to understand better is lost or at the very least obscured. Within those groups lie incremental value – value that eluded less sophisticated approaches.

Ayasdi recently performed just such an analysis for a domestic G-SIB institution on their automotive loan portfolio. Using Aysadi’s TDA-infused machine intelligence approach, the bank was able to identify performance improvements of their loan portfolio by 103bp in less than two weeks. Given the size of the loan portfolio, this was worth over $34 million annually to the bank, despite the fact that this portfolio was heavily analyzed.

[1] Nicolau, M., Levine, A.J., and Carlsson, G., Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival, Proc Natl Acad Sci U S A. 2011 Apr 26;108(17):7265-70. doi: 10.1073/pnas.1102826108