Why Topological Data Analysis?

Topology within mathematics can be characterized as that part of the subject which studies notions of shape.  It really consists of at least two separate threads, one in which one attempts to “measure” shape, and in the other in which one attempts to find compressed combinatorial representations of shape and analyze the degree to which these representations are faithful to the shape.  The first proceeds primarily via algebraic invariants, such as homology and homotopy groups, to measure and count the instances of particular patterns within the shape in a suitably systematic way.  The second is the subject of a great deal of manifold topology, and is exemplified by the work on the “Haputvermutung” concerning the existence of a common subdivision of any two triangulations of manifolds. 

Both these threads have been extended to the world of point clouds of data.  The measurement aspect is extended via the theory of persistent homology and its variants.  The second one is extended by various simplicial complex constructors, such as Vietoris-Rips complexes, witness complexes, and the complexes constructed by Ayasdi’s Platform.  In ordinary topology, the role of the combinatorial representations is to lend additional concreteness to the study of the shape, as well as to provide a succinct representation of it.  They serve the same purpose in the study of high dimensional and complex data sets, in that they provide a compressed representation of the data which retains information about the geometric relationships between data points.  The representations are also easy to work with, so they provide extremely useful and simple ways to interrogate the data, and to understand the driving variables characterizing various subgroups.  At a high level, one can say that they allow for easy identification of coherent groups within the data.  The search for coherent groups, performed naively, is a clearly intractable problem since it requires searching through the collection of subsets in the data set. 

Ultimately, both sets of ideas will be useful in permitting investigators to study their data.  The representations are at the forefront, because they are what a user deals with directly.  As we move further into automation, the measuring of the shape of a data set and of Ayasdi’s complex outputs will be critical, since we will want, for example, to test Ayasdi constructions for the presence of geometric features such as flares and loops, so as to provide the user the best possible “quick analysis”, automatically building  complexes for the user without requiring by hand selection of parameter values, metrics, and lenses.