Exponential Change and Unsupervised Learning

Brian Hopkins of Forrester Research recently penned an excellent blog post about why companies are getting disrupted and why they realize it so late. The post draws from Ray Kurzweil’s Law Of Accelerating Returns and speaks to the fact that the human brain doesn’t do well with exponential growth.

From Brian’s post:

“First, we assume change is linear and gradual; then, we adjust our projections as things accelerate; finally, in one doubling cycle, we go from OK to in trouble. What we don’t get is that the system has been exponential all along, quietly doubling or being cut in half. Then… wham. Disruption.”

We talk about this phenomena at Ayasdi in terms of the number of possible questions that your data holds.

For small, low dimensional datasets we feel in control, however, for larger datasets or high dimensional datasets we quickly find ourselves in this “exponential scramble” that Perkins references.

In truth, when it comes to data, it is actually far worse and speaks directly why so many CEO’s feel they are not getting enough out of their data.

Here is why.

In almost every organization data is growing at an exponential rate. This is only part of the problem. Just as important is the fact that the number of questions you can ask of the data grows at 2n.

That means that the number of questions/hypotheses in your data isn’t growing at an exponential rate but at a double exponential rate.

We don’t make data scientists at an exponential rate, much less a double exponential rate (which is good for the planet). As a result, we produce more data and get less and less from it – despite huge technical advances over the past decade. The response of throwing people at the problem (see anti-money laundering and other regulatory responses in financial services) is also insufficient – even with better tools.

The implication is simple; for enterprises to maximize what they get out of their data they have to change the way they analyzing their data.

This is difficult to hear for many executives. They have invested heavily in technology and analytics over that past number of years to become better utilizers of data. Those investment have made a difference – businesses do have greater data acuity. The problem is that these investments are all geared to deliver better questions:

  1. By allowing the enterprise to ask questions more rapidly (better performing databases)
  2. By allowing the enterprise to ask more questions (through SQL applications)
  3. By allowing more people in the enterprise to ask questions (through BI tools)

This approach, asking more questions faster, is fundamentally flawed in the face of the modern dataset.

The reason is this – the era of the question is dead.

Asking questions of a large dataset is a futile exercise. A subject matter expert with the best tools has a .00000000000000000000000000000000000000000000000001% chance of asking a question that generates a new insight. Even when his/her question does return something of value, it is, by definition confirmatory. That is, they already knew about what they were asking (otherwise they would not have been able to form the question).

This process of developing, asking and waiting on the response, interpreting the response is time consuming, specialized and racked with opportunities for bias. All the while, more data flows in – complicating matters further.

So what replaces the era of the question? Interestingly enough, answers.

Unsupervised learning is a field of artificial intelligence that dispenses with the inefficiencies of the traditional query. Unsupervised learning finds patterns, structure or anomalies in the data without the need for labels or even a preconceived notion of what the data holds.

Unsupervised learning lets the data tell you what secrets it holds.

The result is that the first time a subject matter expert interacts with the data they already have the answers to what is meaningful from the data. These answers are the starting point for further investigation – so there are still questions, but the starting point for those questions is far more principled than the mega-millions lottery approach that so many organizations take with the query first approach.

This business implications for enterprises are profound:

  1. Their best assets (subject matter experts) are far more efficient
  2. They find the unknown unknowns in their data that (often biased) questions never uncover
  3. They can extract the maximum value from their data assets

Unsupervised learning is more accessible than most companies think and frankly, undervalued, in a world where the most valuable information often comes without labels – confounding most of what passes for machine learning.

While the field of unsupervised learning has only a few serious players (Google, Amazon, Facebook, Microsoft, ourselves) the movement toward application based AI has put this capability into the hands of everyone from the analyst to the CEO.

By using applications as the container of these capabilities, the process of discovery happens automagically for the business user (while retaining all of the semi-supervised and supervised capabilities for the data science teams).  Furthermore, unsupervised does not mean opaque. Using technologies like topological data analysis, these powerful unsupervised methods can be made entirely transparent and justifiable.

The results are already in the market:

  1. The financial institution that used unsupervised learning to distill the complexity of a massive, global capital markets organization into something that a regulator could understand in a fraction of the time and manpower used by traditional methods.
  2. A leading hospital that looked at billions of datapoints to impute the optimal way to perform dozen of surgical procedures – reaping $10s of millions in savings while delivering superior care to their patients.
  3. A global bank that reduced the false positives associated with its anti-money laundering (AML) operation by 26% – resulting in more than $50M in savings.
  4. A massive government contractor that used unsupervised learning to identify programs that would go “red” up to six months in advance – allowing them to remedy the situation proactively rather than retroactively.

There are many dozens of such examples. Once these organizations moved past the requirement that they ask the questions they found answers – answers to questions they would have never thought to ask but held huge value to the business.

The future of enterprise class AI depends on a number of elements – but the ability to discover insight from data without explicit instructions is paramount. In the world in which we operate with exponential data growth and double exponential hypothesis growth this capability will be what defines competitiveness in the petabyte economy.