BY Gunnar Carlsson
When we analyze data with a particular goal in mind, we often think of situations where the data is labelled with the outcome we are interested in, i.e. that there is some ground truth in the data.
For example, if one is studying a data set in which the desired outcome is to identify fraud, then one would very much like to have the data points (which include a lot of information about persons or entities) to also include information about whether or not the particular person or entity committed fraud. When the label fraud/not fraud is given, one can apply various methods of supervised data analysis to obtain a predictor or classifier to automate or help to automate the task of fraud identification.
Unfortunately, the real world business problems are rarely that cooperative and quite often we are faced with the challenge of sparse or no ground truth when it comes to fraud. In these situations it is incredibly valuable to be able to accurately determine fraudulent data points. Even in those situations where some of the labels might be available (sparse ground truth), it is often to be expected that there will be new forms of fraud or other behavior that one wants to discover, even if they don’t fit into any previous patterns. This means that they will not be discovered by methods that attempt to match with already occurring phenomena for which there are labels.
Cyber security is a domain where the challenge of limited/no ground truth is particularly prevalent. While it is counterintuitive that one could discover fraud, cyberintrusion, etc. without having any labels indicating what is fraud or cyberintrusion there is an emerging set of approaches that can dramatically improve outcomes.
To see this, let’s consider the problem of discovering money laundering. This is a very complex problem, facing many banks and other financial institutions, and is of great concern to financial regulators.
There are many approaches used by money launderers. For example, one can break up a large amount of cash into much smaller amounts, which can then be deposited in banks or used to purchase money orders or cashier’s checks for the smaller amounts. Alternatively, the funds to be laundered can be merged with funds in cash intensive businesses, such as car washes or parking lots, and then deposited with the legitimate amounts coming from the business, thereby hiding the illegitimate funds. Another option is purchasing expensive objects, such as real estate or jewelry, and then selling it so as to produce a “legitimately” large amount of cash. These are some of the methods available, but there are many more.
As a financial institution discovers laundering methods they will attempt to search for customers who have collections of transactions that match each of the known methods. This requires that the financial institution formalize the methods involved (into rules or scenarios) and then be able to recognize methods that are variants on each of them. The implication of this rule-based approach is that the institution will not find new methods of laundering until a sufficiently long time passes at which point they can recognize the new method. At this point they would update the rules/scenarios.
This approach, often referred to as “hand-coding” rules is quite labor intensive and as we have seen – is generally ineffective.
There is another approach that concerns itself with the characteristics of the data, not their descriptions. It falls under the general heading of anomaly detection, and can automatically generate insights on the data that can be used to increase the likelihood of identifying launders.
Here is how it works in the aforementioned money-laundering example.
The financial institution first considers ALL of its customers using their behaviors as measured by their transactions with the banks, and perhaps with others outside the bank, to the extent such data is available. Using various analytic methods (including topological data analysis) the bank can determine which of the customers are anomalous, i.e. are not similar to large groups of other customers. Those anomalous customers will have an increased likelihood of being launderers.
Methods using TDA for performing this kind of anomaly detection have proved to be very effective.
Additional possibilities involve segmentation analysis of one’s customers into groups, and identifying those groups for which there has been an elevated number of investigations carried out. In this case, one is using the proxy of number of human investigations initiated for the actual ground truth.
Indeed, TDA delivered a 25%+ reduction in false positives while simultaneously identifying undetected launderers for one of the world’s largest banks using just this approach. This was, conservatively, worth $50M+ per year to the bank and was a function of TDA’s ability to handle unlabeled data.
Another situation where one might have unlabeled data is the following.
Considering measurements from an airplane flying across the Atlantic. There are numerous measurements that might be taken from the recorders in the airplane. For example, one can measure the thrust of the engines, the altitude, the angles made by various control surfaces (rudder, elevator, ailerons, flaps, spoilers). Suppose that we have each of these measurements (without time stamps) from a trip, and with no labeling of any kind concerning the state of the airplane at the time of the given measurement. We might want to construct a labeling or ground truth concerning each of the data points. A TDA analysis used on such measurements might look something like this.
The topological model is characterized as having a core and three flares, each of which correspond to a group in the data set. In order to understand the meaning of this structure, we can create the groups by selecting them, and then attempt to find the explanatory variables that best reflect the distinction between each of the flares and the core and the total data set.
For example, we might find that the variables characterizing flare (1) are the spoiler use, flap angle, and rate of change of altitude. From information about the behavior of the airplane, one could then conclude that flare 1 consists of data points obtained during landing.
Similarly, one might find that flare (2) is characterized by elevated change in altitude, and increased change in the use of the control surfaces, for example rudder, elevator, and ailerons. This would lead one to recognize this group as the data points occurring while flying at altitude in turbulent conditions.
Similarly, the core might have explanatory variables that included decreased change in the control surfaces, and the position of the flaps being at zero angle. This could lead one to conclude that they correspond to points collected while flying at altitude in non-turbulent conditions.
Finally, we might find that flare (3) is the take off group, which is characterized by significant change in altitude and a fixed angle of the flaps.
The topological model can be used now to create a labeling or ground truth, by assigning each data point to one of the four states we have just discussed. Of course, this is a very simple and trivial example, but it illustrates the process of label creation from topological models.
It is worth noting this approach has led to breakthroughs in healthcare as well. TDA was used to study Type 2 diabetes and asthma, to obtain a labeling consisting of various subtypes of the diseases. The understanding of these subtypes is crucial to the development of precision medicine.
For more information on how we can apply our technology to your unlabeled challenges drop us a note at firstname.lastname@example.org.