In earlier posts, we have talked about what TDA is, and why it is a powerful tool. In this post, we will talk about specific ways it is used to get useable information from data.
We will use the specific part of TDA that constructs topological networks to reflect the shape of the data set. We will call this topological modeling, since it produces a compressed model for the data set.
Typically when one thinks of modeling, it refers to algebraic models, such as linear or logistic regression models. The outputs of such modeling are algebraic equations that can be used to predict or classify.
Topological modeling produces a topological network instead, and the goal of this post is to describe some of the things one can do with this kind of model. There are a few broad categories, which we now discuss.
Unsupervised Analysis/Taxonomy Generation
Since the network produces a shape representing the data, we can use our ability to recognize features of shapes to call attention to certain regions in our data. These regions can then be used to suggest taxonomies and hypotheses. One typical situation is the following.
In this case, we see some features in the network, namely three distinct “flares” emanating from a central core. Understanding the reason for the existence of these features is a very useful thing to do, because it makes clear systematic but extreme behaviors within the data points. It suggests that we have a taxonomy consisting of three groups, namely the members of each of the nodes in the flares.
One example that comes out exactly this way is a simple diabetes data set from the 1970’s, in which one finds that the taxonomy defined by the flares corresponds to the earlier taxonomy defined by physicians, whose groups are “healthy”, “pre-diabetic”, and “overt diabetic”. These characterizations by physicians are not used to construct the network model. Once groups are defined, it is important to understand the features that define them. In the diabetes example, one of the flares is characterized by the fact that most of the subjects in the nodes in that flare have very high blood glucose levels. This fact is easy to recover, and this kind of method allows one not only to know that there are three groups, but also what characterizes each group.
Hotspot Outcome-Based Analysis:
One capability that is very useful in dealing with topological networks is the ability to color it by a particular variable, or other characteristic quantity of the data points. Remember that in the representation, the nodes in the network correspond to collections of data points, so typically one must color by the average value of the quantity. In dealing with data, one is often interested in a particular outcome, and using that outcome to color the nodes by its average value is very useful in understanding what the different ways in which the outcome is determined by the data. Here is an example.
In this case, the data set could be a collection of macroeconomic indicators, and an outcome of interest could be correlation with revenue for a particular company or business unit within a company.
The coloring of the network for this quantity shows the presence of three red “hotspots”, each of which consists of a collection of nodes, which in turn corresponds to a collection of indicators for which the correlation with revenue is high. This information might now be used for constructing predictive models on small numbers of variables, perhaps by selecting one indicator from each of the three groups.
Geometry Based Feature Selection
When a data set is given as a data matrix, it is possible to build a network by regarding the columns instead of the rows as a data set. This makes the set of features into a space, and ultimately a network. There are many reasonable ways to impose a distance function or dissimilarity measure on the set of columns.
There are now a number of ways to use the network structure as a heuristic for selecting a much smaller set of variables. One method would be landmarking the network, i.e. by choosing a set of nodes that is well distributed over the network.
In this case the red nodes are the landmarks.
A strategy for selection of features is now to choose one representative feature from each landmark node. Due to the even distribution of the landmarks, one expects that every field is reasonably similar to one of the selected fields.
Another option is what one might call geometric selection, by which we will mean choosing representative features in nodes based on features found in the network. For example, if the network is the three-way junction above, one might select one feature from each node at the tips of the flares. These tips are in a sense the most extreme parts of the space, and one might expect that they would provide the clearest description of the data set of rows.
One can also consider density in the set of features. In this case, there might be nodes that contain particularly dense points in the set of features, and these nodes could be detected by coloring by a density measure. Of course, there are many ways of constructing proxies for density, and we assume we have chosen one. If one finds that some features are much more dense than others, then it is likely that those features will contribute more to many analysis methods, and less dense features might not be visible. In this case, one might attempt to normalize the density by selecting one feature from each node, or one might even normalize the features by dividing by an appropriate proxy for density. Finally, as in the hotspot discussion above, one might find hotspots for correlation with an outcome of interest, and select one feature from each such hotspot to get a view of the data set from which is well adapted to analysis of the outcome of interest.
Ultimately, there are many topological network models that can deliver usable information, however the three we have outlined here:
- Unsupervised analysis/taxonomy generation
- Hotspot outcome based analysis
- Geometry based feature selection
Each approach will deliver slightly different insights, but all of them will deliver useful information about complex data sets.
Feel free to add a comment with some additional thoughts on other methods.