Predictive Analytics | August 31, 2015

Topological Modeling and its Uses

BY Gunnar Carlsson

In earlier posts, we have talked about what TDA is, and why it is a powerful tool. In this post, we will talk about specific ways it is used to get useable information from data.

We will use the specific part of TDA that constructs topological networks to reflect the shape of the data set. We will call this topological modeling, since it produces a compressed model for the data set.

Typically when one thinks of modeling, it refers to algebraic models, such as linear or logistic regression models. The outputs of such modeling are algebraic equations that can be used to predict or classify.  

Topological modeling produces a topological network instead, and the goal of this post is to describe some of the things one can do with this kind of model. There are a few broad categories, which we now discuss.

Unsupervised Analysis/Taxonomy Generation

Since the network produces a shape representing the data, we can use our ability to recognize features of shapes to call attention to certain regions in our data. These regions can then be used to suggest taxonomies and hypotheses. One typical situation is the following.

Gunnar 1

In this case, we see some features in the network, namely three distinct “flares” emanating from a central core. Understanding the reason for the existence of these features is a very useful thing to do, because it makes clear systematic but extreme behaviors within the data points. It suggests that we have a taxonomy consisting of three groups, namely the members of each of the nodes in the flares.

One example that comes out exactly this way is a simple diabetes data set from the 1970’s, in which one finds that the taxonomy defined by the flares corresponds to the earlier taxonomy defined by physicians, whose groups are “healthy”, “pre-diabetic”, and “overt diabetic”. These characterizations by physicians are not used to construct the network model. Once groups are defined, it is important to understand the features that define them. In the diabetes example, one of the flares is characterized by the fact that most of the subjects in the nodes in that flare have very high blood glucose levels. This fact is easy to recover, and this kind of method allows one not only to know that there are three groups, but also what characterizes each group. 

Hotspot Outcome-Based Analysis:

One capability that is very useful in dealing with topological networks is the ability to color it by a particular variable, or other characteristic quantity of the data points. Remember that in the representation, the nodes in the network correspond to collections of data points, so typically one must color by the average value of the quantity. In dealing with data, one is often interested in a particular outcome, and using that outcome to color the nodes by its average value is very useful in understanding what the different ways in which the outcome is determined by the data. Here is an example.

Gunnar 2

In this case, the data set could be a collection of macroeconomic indicators, and an outcome of interest could be correlation with revenue for a particular company or business unit within a company.

The coloring of the network for this quantity shows the presence of three red “hotspots”, each of which consists of a collection of nodes, which in turn corresponds to a collection of indicators for which the correlation with revenue is high. This information might now be used for constructing predictive models on small numbers of variables, perhaps by selecting one indicator from each of the three groups.

Geometry Based Feature Selection

When a data set is given as a data matrix, it is possible to build a network by regarding the columns instead of the rows as a data set. This makes the set of features into a space, and ultimately a network. There are many reasonable ways to impose a distance function or dissimilarity measure on the set of columns. 

There are now a number of ways to use the network structure as a heuristic for selecting a much smaller set of variables. One method would be landmarking the network, i.e. by choosing a set of nodes that is well distributed over the network.

In this case the red nodes are the landmarks.

A strategy for selection of features is now to choose one representative feature from each landmark node. Due to the even distribution of the landmarks, one expects that every field is reasonably similar to one of the selected fields.

Another option is what one might call geometric selection, by which we will mean choosing representative features in nodes based on features found in the network. For example, if the network is the three-way junction above, one might select one feature from each node at the tips of the flares. These tips are in a sense the most extreme parts of the space, and one might expect that they would provide the clearest description of the data set of rows.

Gunnar 3

One can also consider density in the set of features. In this case, there might be nodes that contain particularly dense points in the set of features, and these nodes could be detected by coloring by a density measure. Of course, there are many ways of constructing proxies for density, and we assume we have chosen one. If one finds that some features are much more dense than others, then it is likely that those features will contribute more to many analysis methods, and less dense features might not be visible. In this case, one might attempt to normalize the density by selecting one feature from each node, or one might even normalize the features by dividing by an appropriate proxy for density. Finally, as in the hotspot discussion above, one might find hotspots for correlation with an outcome of interest, and select one feature from each such hotspot to get a view of the data set from which is well adapted to analysis of the outcome of interest.

Ultimately, there are many topological network models that can deliver usable information, however the three we have outlined here:

  • Unsupervised analysis/taxonomy generation
  • Hotspot outcome based analysis
  • Geometry based feature selection

Each approach will deliver slightly different insights, but all of them will deliver useful information about complex data sets.

Feel free to add a comment with some additional thoughts on other methods.


Additional Resources

AML | April 23, 2020
COVID-19 Stimulus. Unfortunately, Great News for Financial Crime

Oh to be a money launderer, a tax evader or frankly any established financial criminal in today’s world. Financial crime is already the most profitable business in the history of the world, but it’s possibly entering a golden age. $6 trillion of panic stimulus is being injected into the financial system over the next few months. In […]

Artificial Intelligence | February 12, 2020
AyasdiAI Model Accelerator recognized as Prime Example of AI Governance by Singapore Government at Davos 2020

On Jan 22nd, 2020, Singapore launched its second edition of its Model AI Governance Framework at the World Economic Forum in Davos at a joint press conference for the Fourth Industrial Revolution (WEF C4IR). This framework’s “unique contribution to the global discourse on AI ethics lies in translating ethical principles into practical recommendations that organizations […]

Machine Intelligence | March 11, 2016
The Gartner Thermometer – Analytics Hot, Storage Cold

The Gartner BI & Analytics Conference is next week. It is one of the most well-attended events in the big data calendar each year with some great presentations and an all-star cast of analyst and analytics vendors.  As we have written before, we see the big data ecosystem as having three parts: storage, analytics and visualization. […]