With heavy regulatory requirements and the pressure to cut costs, banks are increasingly turning to technology to increase efficiency and effectiveness.

Having a fine-grained understanding of its customers and their activities is at the heart of several global banking functions.

For instance, Private Banking teams are trying to market the right financial product to the right customer at the right time. Risk Management professionals are attempting to identify fraud and money laundering cases before it’s too late. Consumer banking teams are trying to identify those clients that are likely to churn.

The analytic process in the majority of these use-cases begins with systematically segmenting customers along with their transaction activities to create subpopulations. Once segmented, models are built for each of the groups so that predictions can be made on out-of-sample transactions.

Current approaches make heavy use of static taxonomies and use simple rule-based approaches for modeling. This often results in coarsely defined segments and associated models which are neither effective (not enough true positives) nor efficient (too many false positives).

This problem is further exacerbated if the outcome variable represents a small percentage in the dataset. A superb example of this is fraud or money laundering, where the occurrence is particularly infrequent.

To make this clearer, let’s walk through an example by using a public available dataset from The University of California at Irvine which can be found here. The dataset contains an anonymized sample of a European bank’s customer-level information, sometimes referred to as “know your customer” (KYC) data.

Each row is a unique customer and columns represent customer demographic, account, transaction-level and market condition attributes. This allows a customer to be compared with the peer-group and their own baseline metrics as well. In this dataset, the outcome variable is a binary label of whether the customer subscribed to a term deposit product.

In total, there are approximately 40K rows and over 60 columns after categorical transformations.

Using Ayasdi, we’ll walk through the process of detecting related customer segments, surfacing an interpretable set of features that differentiate each segment and building simple predictive models using classifiers. The topological summary below represents the first step in the process and is a way of representing the relationships between entities in a transparent way.


Each node in this map represents a set of related customers and an edge represents a common customer between the respective nodes.

As you can see, there are distinct groups within the map, both in the main body (defined by flares) and in the distinct “islands” that surround it.

Color represents a value or label of interest. Those could include whether a a customer has subscribed to a financial product or if a customer has a particular cash balance.

In this case the map is colored by customers that have subscribed to a term deposit. The areas in red represent particularly high concentrations of those customers often referred to as “hotspots.” The flare region highlighted by the white circle is an area that we will focus on. It is differentiated from the rest of the points by the following key categorical and continuous features.


Specifically, the customers in this group tended to not have subscribed to any products in the previous campaign and were contacted primarily through mobile at a time when the euribor 3m rate and consumer confidence index were significantly lower than historical levels.

While this insight may not strike you as groundbreaking it is a complex and nuanced finding. Furthermore, it took less than one minute to generate. There were no SQL queries or iterative hypothesis generation efforts to find some combination. That work is done by the software. There was no need to consider factors individually and apply various filtering and sorting criteria.

At this point, a simple classifier model can be built using the identified groups and the features that define them. We build ours using randomly chosen data (stratified sampling) representing 80% of the data and made that the training set. Another 20% of the data was held out for testing.


The chart above shows a principled tradeoff between the resulting true positive and false positive rate based on tuning classification thresholds.

While we did this for a term deposit, the same methods would for use cases such as fraud or money laundering where transactions or activities are constantly monitored and tuned.

The speed, comprehensiveness and accuracy of this method is a function of Ayasdi’s mastery of Topological Data Analysis (TDA). Acting as a framework for machine learning algorithms, TDA particularly effective at problems involving segmentation. Without the use of a high quality segmentation model there are either too few or too many positive cases and this has material implications for the bank in terms of heightened risk or heightens costs (frictional and employee).

Once developed, these models are deployed operationally. In an operational setting, the topological model is periodically retrained and classified on new incoming customer transactions. The results of these could be used in an existing system for flagging alerts or the escalation of transactions.

In future posts, we’ll show you examples of how existing models or rules engines can be used as additional features to drive the classification model for various segments. In addition, we’ll show how the outputs of such models can be consumed through a case management framework.