Understanding the Distinction Between Clustering and TDA

We are often asked to explain the distinction between cluster analysis and Topological Data Analysis (TDA).  

The primary distinction is that TDA can effectively represent continuous variation, whereas clustering methods cannot.  That is an important distinction with very real implications, as demonstrated in our earlier post on this topic.  

It is also the case, however, that there are important use cases where the output or an internal state of the computation is a division of a data set into groups. Because the output of a clustering method is such a decomposition into groups, it is then assumed that TDA is simply a clustering method.

In this post, I want to further clarify the relationship between TDA, clustering and segmentation, and discuss why the flexibility of the TDA framework usually gives improved results.  

To begin, cluster analysis [1] is a method of discretization – that is, any procedure which takes continuous information and constructs a discrete outcome, i.e. a finite collection of points. While cluster analysis is a method of discretization, it should also be noted that it is not the only one.  We will give two other examples, segmentation and hotspots, but there are certainly many others.

Segmentation

Segmentation is often conflated with clustering but that is incorrect.  Let’s explain the distinction between them by an analogy.  Consider the map of the United States, and let’s divide it into its connected components, by which we mean maximal connected regions.  In this case, the cluster decomposition would divide the landmass of the United States into the lower 48 contiguous states, Alaska, the eight islands of Hawaii, and numerous other islands, such as  Nantucket, Santa Catalina, the Aleutians, etc.  If we include territories, it would also include Puerto Rico, the U.S. Virgin Islands, Guam, etc.  

The point is that within the lower 48 states, it is possible to travel by land from any one state to any other, and it is therefore not broken up into connected pieces.  The cluster analysis is in this case an automatic well-defined procedure, not subject to human intervention.  By contrast, one could divide the lower 48 states into the 48 different states, each of which is a segment.  Since each segment abuts some other segments, it is not a clustering, but would be termed a segmentation.  There is a fair degree of arbitrariness about it, since these boundaries were created by human decisions and negotiations.  Another possible segmentation would be into five segments, namely the Northeast, the Southeast, the Midwest, the Pacific Northwest, and the Southwest.  These segments are less arbitrary, in that they reflect geography and culture of the inhabitants, but it would be difficult to arrive at them by any automated procedure.  

There are mathematical methods for performing analogues of these tasks. Clustering algorithms are attempts to reconstruct the connected components of an object from samples taken from that object.  There is a large variety of algorithms, typically obtained via optimization of various objective functions that capture some intuitive aspects of the notion of connected components.  

No single choice of such objective function, however, will be suitable in all circumstances.  

There are also segmentation algorithms for various different situations, such as image segmentation, graph segmentation, text segmentation, audio segmentation, and numerous others.  Some of these algorithms come under the heading of Pattern Theory (see [2], [3]), but there are many different algorithms.  Here the methods usually vary from data type to data type.  There is no unified mathematical concept that plays the same role for segmentation that the notion of connected components plays for clustering.

Hot Spot Analysis

Mathematically, hot spot analysis examines the behavior of a function on a geometric object and studies regions in which the value of the function is elevated relative to surrounding points.  For example, there are often local maxima of a function on the plane, and one would then determine these points, together with small regions around them where the value of the function is elevated.  In terms of the geographical analogy above, if we consider the function on the United States map given by elevation over sea level, hot spot analysis would generate a collection of mountain ranges.  The Rockies and the Appalachians would all appear as groups, and they would be distinct since we cannot take a path from one to the other while maintaining high altitude.  

The discretization would in this case be a list of mountain ranges.  

Note that there are questions here about what one would consider to be high enough to constitute a mountain range, and this requires human decision making.  Therefore, hot spot analysis is not a priori an automatic procedure, although there are approaches to automation that can approximate human judgments.  Note also that this kind of analysis does not constitute a segmentation of the entire set, since only the points at high elevation belong.  It is instead a segmentation of points that are of high altitude.  

Hierarchical Clustering

Returning to clustering methods, there are numerous methods developed whose input is a finite set with distance function (finite metric space) and whose output is a partition of the underlying set. Each of them is based on a simple procedure or objective function, and is designed to mimic the notion of connected components.  

It was found that no one method works well for all data sets, and that one is therefore reduced to trial and error among methods.  Further, there is no theory that demonstrates that any of these methods bears a strong relationship to the notion of connected components of a space from which the points of the data set are sampled.  It is a significant problem that there are so many choices of clustering algorithm, and it is difficult or impossible to examine them all.  

Hierarchical clustering was developed as a way of addressing this issue in a limited way.

Let’s recall that single linkage clustering is obtained by choosing a threshold R, and then connecting all pairs of data points whose distance is less than or equal to R.  This produces a partition of the data set into disjoint subsets.  However, it depends on a choice of threshold level, and of course there are infinitely many choices for the threshold. This means that we are in a situation where we have an infinite “search space” of cluster decompositions, one for each value of R.  It is of course impossible to examine the entire search space directly, but hierarchical clustering provides a method for obtaining an understanding of that search space in a useful way.  

At the heart of hierarchical clustering is the construction of a dendrogram. This provides a clear summary of the partitions at every level of R in one package. The dendrogram should be viewed as a search tool for a small subspace of the space of all cluster decompositions. Once the dendrogram is constructed, one can examine it by eye to select a plausible value of R, and one can also develop algorithms that operate on the dendrogram to automate the selection procedure.  

Of course, as constructed, the method deals only with single linkage clustering, but is nevertheless useful.

Variants on this procedure (such as HDBSCAN)  have been developed, and have been demonstrated to have great utility.  For us, the key point of the example is that it makes accessible a subspace of the set of partitions, that can be examined and acted upon without one by one enumeration of possibilities.  Further, it makes clear the relationships among the different partitions for varying values of R.

TDA and Clustering

As mentioned above, TDA has a number of uses beyond clustering.  However, it can and is used as a way of clustering data, and more generally to create families of groups which might not necessarily form a partition of the data. In particular, it can also be used as a segmentation tool. To use it as a clusterer, one can build the topological model (a network in the computer science sense) of the data, and form the connected components of the graph.  

This is an automatic procedure, and often works quite well.

Because the topological model is designed to be an approximation to the underlying shape of the data, these cluster will be expected to approximate the connected components of the idealized space from which the points are sampled.  However, by examining the data by querying the groups, one might see that there are groups that belong together and should be merged, and one might also see that there are also groups that should likely be split.  

The point is that examination of the network allows one to examine a part of the search space of all cluster decompositions and, in a rough way, find decompositions that approximate the decomposition of the idealized space into connected components.  

TDA and Segmentation

Segmentation is understood to be a complex problem, with different methods available depending on the kind of data.  It also typically proceeds by optimizing an objective function, but they tend to create more complex optimization problems and they are often dependent on the nature of the data being segmented (images, audio, text, etc.)? We have developed segmentation methods particular to TDA, and found that they work extremely well by circumventing the need to create an objective function. We have found that this approach works particularly well in solving many of the problems of our customers.

TDA and Hot Spot Analysis

When one has a topological network model of a data set, and an outcome variable of interest, or a group within the data set, one can create functions on the nodes on which one can perform hot spot analysis.  In the case of a continuous outcome variable, since each node corresponds to a collection of data points, one may evaluate the mean value of the variable and define that to be the value of a function of the nodes.  Similarly, if we have a given group, each node can be assigned the fraction of the members of the corresponding collection that belong to the group.  This is also a quantitative function of the nodes.  One can now perform hot spot analysis on this function on the nodes of the network, and this kind of analysis has been shown to be enormously valuable.  Because one has the explicit geometric model, it is possible to select the hot spot groups without having to construct artificial methods for selection of thresholds.  

Conclusion

All of the different kinds of discretization have a great deal of importance for problems in business.  Clustering and segmentation are fundamentally important in understanding one’s customer base, in performing anomaly detection, in understanding patient cohorts, and many other situations.  Similarly, hot spot analysis is vitally important in many more supervised problems, such as determining genetic profiles with increased propensity for various disorders, trading strategies that have unusually high return, and many others.  

 

[1] Brian S. Everitt, Sabine Landau, Morven Leese, and Daniel Stahl, Cluster Analysis, Wiley, 2011.

[2] Roger LeRoy Miller and Ulf Grenander, Pattern Theory: From Representation to Inference, Oxford University Press, 2007.

[3] David Mumford and Agnes Desolneux, Pattern Theory: The Stochastic Analysis of Real-World Signals, CRC Press, 2010.