Topological Data Analysis: A Framework for Machine Learning

Machine learning is a collection of techniques for understanding data, including methods for visualization, prediction, classification and other tasks relevant for making sense of data. 

The visualization techniques come under the heading of scatterplot methods, where one produces projections of the data points on two or sometimes three dimensions, and then plots the projections on these coordinates in the usual way. The projection techniques include principal component analysis, multidimensional scaling, and projection pursuit.

Topological Data Analysis (TDA), on the other hand, represents data using topological networks. A topological network represents data by grouping similar data points into nodes, and connecting those nodes by an edge if the corresponding collections have a data point in common.  Because each node represents multiple data points, the network gives a compressed version of extremely high dimensional data.

Many network representations do not afford this compression, and so produce a complex network that can be difficult to interpret. What is interesting is that the topological network can be constructed off of the results of machine learning techniques, and can therefore produce a representation of the scatterplot that is easier to understand and interact with, in addition it often provides more resolution of the data.  Topological networks allow individuals to easily interrogate machine learning outputs in a way that highlights high value segments of the data. 

3 pictures copy

Above, you will see a scatterplot on the left, in the middle a network representation in which each node corresponds to a data point, and on the right a topological network in which each node is a collection of data points.

Cluster analysis is another class of techniques within machine learning.  In cluster analysis, the goal is to divide a data set up into disjoint groups that have some distinct defining properties, or conceptual coherence.  When data sets naturally break into such distinct groups, as in the case below, this family of techniques works quite well at finding such a decomposition.  

point cloud

In other situations, though, such as the data set below, they will not. However, using TDA you are able to form groups of data points, but retain information which will allow us to connect the groups to indicate which ones contain points which are close the points in another cluster.  

chart loop

So, a topological network representation of the data set above would look as follows. 

loop

Note that it accurately captures the “loopy” characteristic of the data set, a property which could not even be expressed within clustering theory.  This loopy behavior in this case represents periodic behavior in the data set, which is an important characteristic to know about.  This is what is meant by the statement “shape has meaning”.  

Topological Data Analysis can be used as a framework in conjunction with machine learning to understand the “shape” of complex data sets, and which can also be used to study data where the elements themselves encode geometry, such as in images and organic compounds.