The Future of Machine Learning is an incredible topic worthy of thick books.  The challenge is that because the area is moving so quickly information becomes obsolete in the blink of an eye.  O’Reilly Media solved that problem by publishing an excellent ebook on the subject with interviews by leading practitioners.  At 77 pages, it is airplane reading. To see the whole book hit the jump here – screaming value in exchange for your email address.  

We did want to post Gurjeet Singh’s interview and have included it below.  His enthusiasm for the current state of play is palpable and is reflected in his other recent posts, particularly his piece on the Mainstreaming of Machine Learning

Without further adieu…

Let’s get started by talking about your background and how you got to where you are today.

I am a mathematician and a computer scientist, originally from India. I got my start in the field at Texas Instruments, building integrated software and performing digital design. While at TI, I got to work on a project using clusters of specialized chips called digital signal processors (DSPs) to solve computationally hard math problems.

As an engineer by training, I had a visceral fear of advanced math. I didn’t want to be found out as a fake, so I enrolled in the Computational Math program at Stanford. There, I was able to apply some of my DSP work to solving partial differential equations and demonstrate that a fluid dynamics researcher need not buy a supercomputer anymore. They could just employ a cluster of DSPs to run the system. I then spent some time in mechanical engineering building similar GPU-based partial differential equation solvers for mechanical systems. Finally, I worked in Andrew Ng’s lab at Stanford, building a quadruped robot and programming it to learn to walk by itself. Then one day I saw a note from my advisor, Gunnar Carlsson, describing how he was applying topology to explain real data sets.

He explained how topology could be applied equally well to four or five very distinct and interesting problem areas. That was really exciting, and I started working with him on the topic. The project was an academic success and DARPA (the Defense Advanced Research Projects Agency) asked us to commercialize our research and start a company. That’s how we started Ayasdi.

Can you tell us about the evolution of topology, broadly speaking, and share some insights as to why it is so useful for unifying disparate areas in machine intelligence?

Topology is a very old branch of mathematics. It was developed in the 1700s by mathematicians like Euler. It was originally developed to quantify the qualitative aspects of algebraic equations. For example, if you have the equation for a circle, topology is the area of math that allows you to say that, for example, “Oh, a circle is a single connected  thing; it divides the plane into an inside and an outside; and it has a simple connectivity structure.” Over the course of its development over the last 300 years, it has become the study of mapping one space into another.

For example, there are two large classes of machine learning algorithms. There are supervised machine learning algorithms and the unsupervised ones. Furthermore, within supervised algorithms, there are two types: algorithms that take an input vector to predict a number, and algorithms that take a vector to produce a class label.

On the unsupervised side, there are two distinct methods. What unifies these four distinct functions is they all produce functional mappings from an input space to an output space. The built-in formalism of topology allows you to learn across different types of functions. So if you want to combine the results of these various learning algorithms together, topology allows you to do that, while still maintaining guarantees about the underlying shape or distributions.

That’s the first critical insight.

The second insight is that, basically, all machine learning algorithms solve optimization problems. The machine learning algorithm assumes a particular shape of the data for each problem. Then the optimization procedure finds the best parameters that make the data look like that shape. Topology does the reverse. Topology, even though it utilizes all these machine learning algorithms under the covers, allows you to discover the underlying shape of the data so that you don’t have to assume it.

What are some of the key concepts around the application of topology to machine learning?

It’s very simple. There is only one key idea: data has shape, and shape has meaning. In standard machine learning, the shape of the data is usually an afterthought. Topology puts the shape front and center, i.e., as being the most important aspect of your data.

What are the real world applications of this technology? Why is this important?

Today, we’re awash in data. Machine learning algorithms were developed as a methodology to extract value from increasingly large and complex datasets. However, there are now many algorithms from which to choose. The incomplete or incorrect application of machine learning algorithms can lead to missing or even erroneous conclusions.

Topology addresses this issue of increasing data complexity, by the comprehensive investigation of your dataset with any algorithm or combination of algorithms, and presents an objective result (i.e., no information loss).

Using a topological approach, what does a typical investigation look like?

One huge benefit of using topology is that you don’t have to presuppose a library of shapes. You don’t have to say, “Okay, I know what a circle looks like. A circle is our prototype now.” Topology represents your underlying data in a combinatorial form. It constructs a network in which every node in said network contains a subset of your data, and two nodes are connected to each other if they share some data.

If you think about it from a tabular perspective, you feed it your table, and the output is this graph representation in which every node is a subset of the rows. But a row can appear in more than one node, and whenever that happens you connect them. This very simple structure has two huge advantages. The first is that irrespective of the underlying machine learning algorithms that have been combined in a particular investigation, the output will always look like this graph. The second is that this network form is very computable; you can easily build things on top of it, like recommender systems, piecewise linear models, gradient operators, and so on.

Can you generalize that to another example, where the shape is not necessarily a circle?

Imagine that you had the letter Y on graph paper, and you’re sampling data from it. Clustering the raw data doesn’t make sense, because you’ll recover a single cluster—if you’re lucky. If you want to build a regression on it, that’s also wrong, because the data is nonlinear.

Imagine you use the centrality function to reduce the dimensions. So for every point on the Y, you measure the sum of its distance to every other point on the Y. The value of the function at the joining point in the middle of the Y will be low, because all those points are central. The tips of the Y will be high, because they’re far from everything else. Now, if you merge your dimensionality reduction function with clustering, then in the low range you get a single cluster, because it’s the middle of the Y. As you go out of that middle range, you start seeing three clusters, because those are the spokes of the Y.

Is it fair to generalize that when performing a topological investigation, the first order of business is using some form of dimensionality reduction algorithm?

That is correct. Once you reduce the data, compact it, and get the benefit of being cognizant of the topology, you’re able to maintain the shape while uncovering relationships in and among the data.

Basically, different dimensionality reduction algorithms will illuminate different aspects of the shape.

Is it just a matter of throwing all of these algorithms up against the wall to see what’s interesting?

Yes, essentially, you throw as many of these functions at the data as you can. These are usually computationally expensive functions e.g., Isomap. Topology allows you to compare across them very smoothly. You can discover statistically significant differences in your data in an algorithmic way.

The machinery allows you to do that very beautifully.

Given you can map the same data into different views/representations, is there a single view that’s analytically superior towards the goal of understanding any particular problem?

You have to be careful. You don’t necessarily want a single view. There is no single right answer, because different combinations of these algorithms will produce different types of insights in your data. They are all equally valid if you can prove their statistical validity.

You don’t want to somehow confine yourself to a single right answer. You want to pull out statistically significant ideas from all of these.

Do the insights or results across different views of the same data ever contradict each other?

In fact, one of the things that’s beneficial in our approach is that these algorithms are correlated with each other. In many cases, you find the evidence for the same phenomena over and over again across multiple maps. Wherever that happens, you can be more confident about what you found.

So in other words, features that persist across different views or mappings are more significant?

One of the areas of topology that is especially interesting to this discussion is homology. Persistent homology essentially talks about the stability of features that you discover using topological methods.

You can imagine in many machine learning settings, you have these algorithms that are parameterized in various ways. You somehow have to say, “Okay, this is the set of parameters that I’m going to choose.” You can imagine in all of those settings, it’s very helpful to have tools that tell you the stability range of these parameters. That across this or that range they are going to be stable.

Imagine if you stare at a circle from a long distance, from far enough away, you might conclude that a circle is just a dot. So you have to ask, “Over what range of distances do I call a circle a circle?” And this generalizes to other shapes and the various resolutions in which they can be viewed. There’s a really interesting body of research around this. In fact, some parts of this work are also used at Ayasdi (in our code base), but we don’t expose it.

Looking ahead, what would you consider the most exciting developments in machine intelligence? Is persistent homology the kind of the thing you would tell folks to look at, either inside or outside of topology?

This is the golden age of machine learning. There is so much interesting work going on.

We’ve turned a corner; in the past, people working in the field tended to be married to a particular method.

Now, all of a sudden, people are open to new things. For example, all through 1980s there was a focus on logistic regression, and nobody wanted to do anything else. By the 2000s, the focus has shifted to support vector machines (SVMs) and, again, nobody wanted to do anything else. These days, the whole field seems to have matured.

Everybody is open to different points of view.

I think there’s a lot of interesting work going on in feature engineering. That’s interesting, because on the one hand, we have this whole deep learning core process. So some will tell you that we don’t need feature engineering. But on the other hand, everybody who does feature engineering with deep learning produces much better results.

In topology more specifically, the exciting news is that we now have a few things that work. And we are on the cusp of attaining a theoretical understanding of why that happens; that is, why the things that work—work. When we understand that, we can begin to evolve it.

These are indeed exciting times!