As we kick off 2016 and prepare for the Presidential stretch run, we thought it prudent to examine how Topological Data Analysis can find patterns and insights in survey data. In particular, we are focused on extracting subtle relationships that don’t present themselves willingly, but are valuable to understand.
In this installment, we are going to look at how Topological Data Analysis of the expansive World Values Survey may provide insight into Donald Trump’s surprising and sustained lead among the Republican primary contenders. We will be proposing scenarios about how the candidate preferences would correlate with our findings using the World Values survey data, but since the survey has no information about candidate preferences we cannot say conclusively that our theory is correct. It would be extremely interesting to carry out a survey including a small number of questions in the World Values Survey together with standard political polling to get a firmer picture of the relationship.
The World Values Survey Association (http://www.worldvaluessurvey.org/wvs.jsp) is an organization of social scientists who have developed and carried out a series of surveys concerning values over the last 30 years. The surveys have been carried out in a collection of six “waves”, the most recent one occurring in the time period 2010-2014.
The surveys consist of 250 questions, and are typically carried out in person, although for some remote participants they are done by telephone. A detailed discussion of the methodology is available at the website.
The questions vary in the nature of the response. Answers are given as integers in a fixed range, but that range may be between 1 and 2, between 1 and 4, and in some cases between 1 and 10. In this post, we will be discussing a family of questions regarding the trust a participant holds for various societal institutions. The answer is given in terms of an integer between 1 and 4, with 1 indicating a great deal of trust and 4 indicating no trust at all.
In this analysis, we consider 11 questions, concerning the following institutions:
- armed forces
- the press
- labor unions
- the police
- the courts
- federal government
- political parties
- the civil service
The data set can therefore be viewed as a spreadsheet with 12 columns, and an entry of 1,2,3, or 4 for each row in each column. We are considering the results from participants in the United States, of which there are approximately slightly more than 2,000.
This is a small data set by any measure.
We will see, however, that the data set has an interesting and complex structure. Complexity is an informal measure on a data set, not determined by either the number of columns or the number of rows. To communicate a feel for this notion, let’s look at data set which has a small number of both columns and rows.
This data set has 31 rows (each row corresponds to one of the red points above) and 2 columns, since each point is defined by its x and y-coordinate. This data set is complex because (1) algebraic methods will find it very difficult to approximate it by sets of equations and formulas and (2) clustering methods cannot approximate the data set well, since it is an approximation of a connected object.
It is data sets such as this one that are the motivation for the development of topological modeling. Topological networks can capture this shape readily.
The World Values dataset has similar complexity that does not readily reveal its secrets to regression or clustering much less a pivot table.
The analysis was performed using variance normalized Euclidean distance, and the (two) lenses were the first two coordinates in a Principal Component Analysis. For both lenses, the resolution is 30 and the gain is 3.0.
The analysis should be viewed as a study of personalities based on their confidence response to various institutions.
Here is a picture of the topological model based on these choices.
It is apparent from a visual interrogation of the model that there are distinct groups, generally positioned at the corners. Thus, one way to utilize this network is to select groups around the corners, and find the explanatory variable for each of these groups. This will clarify the nature of the mapping of the data.
In the diagram below, we assign to each of the four groups above a node (recognizing of course that each node is, in turn represented by a dozen or so smaller nodes containing hundreds of respondents), and connect them as they are connected above.
We have included in each node three of institutions that are most characteristic of the corresponding population. The text is colored green in case the approval for the institution is elevated, and red if it is lower than the average value over the data set.
For each of the populations, we have indicated a small number of the institutions that effectively distinguish the population from the remaining respondents. However, in examining the entire list of features (i.e. institutions) that differentiate the nodes, there is more to say.
- Group A has little trust in any institution. While it would be easy to call this liberal, it is more than that. This is the extreme left of the political spectrum characterized by 1960’s-esqe distrust of institutions. This group does not even have high confidence in institutions that are often supported by the left, such as environmental organizations or the United Nations. In all cases, their average confidence in the institution is below the average value over all. While one would likely see these as Nader types, it is not a leap to find these distrusting voters in the Trump camp – and indeed, one does seem to find these outsiders in the background of his rallies.
- Group B has a high degree of trust in most institutions. It is weaker for the police, the armed forces, and the church, but the confidence is in generally high. Group B is also in the left wing, but has uniform confidence in all institutions, including large companies and banks. They would include the so-called “limousine liberals”, but this group also counts members of labor unions among its ranks. For the purposes of discussion, we are going to label this group as L3 for limousine liberal & labor. Trump is probably not pulling heavily from this limousine liberal segment of this group, but there are elements of his platform that do appeal to labor – namely his positions on trade policy and free trade agreements.
- Group C has a high degree of confidence in the police, the armed forces, and the churches, and a lower degree of confidence in other institutions than our L3 types. These are true “law and order” types and as such that is what we will call them. Trump pulls from this group, but not disproportionately – he has plenty of competition from the rest of the field for these “core” Republicans.
- Group D has a low degree of confidence in most institutions, but has elevated confidence in churches. While the church is the primary institutions in which they have confidence, they are not as cynical as Naderites. This group would be defined by the libertarian/evangelical wing of the Republican party. Here again, Trump draws, but not disproportionately.
What is evident, however, is that Trump can pull from all four of these camps – despite where they might fall on the “traditional” political spectrum.
Let’s looks a little closer to some of the delineation lines in the data.
In general, movement from the upper right to the lower left corresponds to ones overall degree of confidence in societal institutions. Let’s look at our re-labeled diagram:
We can become more precise in our characterization of political leanings with regard to placement in this taxonomy. While the survey provides relatively little in terms of political identification, it does have one question that asks the respondent to give their placement in a left/right spectrum, by selecting a number between 1 and 10. Small numbers indicate leftward leanings. Here is the coloring of the same network using the response to this question.
It is apparent that there is a movement from blue in the upper left to red in the lower right, corresponding to the movement from left to right on the political spectrum. Trump is not constrained to the lower right, he actually draws disaffected voters from across the spectrum. This could speak to why he is doing well in national polls.
Of course, if one had more detailed political information, such as the respondent’s choice of candidate in a particular election, one might very well expect to find more granular information, particularly if there are more than two candidates. Another very interesting kind of data would be geographical location of the residence of the respondent, which could also be used with this model.
The structure of the model suggests an additional direction, namely from the upper right to the lower left, which can be characterized as the degree to which someone is part of the establishment.
A proxy for that notion would be the aggregate confidence in institutions. A quantitative version of this would be the sum of the confidence ratings for all the institutions in the list. Here is the coloring of the topological model by the average value of that quantity:
Note that this quantity varies with upper right to lower left motion, i.e. that the confidence in the establishment increases as we move from upper right to lower left. Here again, we find that Trump has the capacity to draw from a larger group than his competitors. Red, Orange and even Yellow are his playground.
A similar perspective on this concept of the establishment is to color the network by the confidence in large companies, which is another question posed in the survey.
Unsurprisingly, this coloring is very similar to the one given by the aggregate confidence score in institutions given above.
This analysis can become incredibly detailed. What is worth noting is that I have less than an hour invested to this point in the analysis.
That is the power of topological data analysis, we have discovered all the meaning within seconds, and can now drill into lines of inquiry that make sense as opposed to crafting a hypothesis, testing it, re-crafting etc. This is particularly valuable in a campaign environment where speed, insight and comprehensiveness are of the essence.
To give an indication about how one can further refine the analysis, we note that there is another interesting group, labeled below.
We’ll call this Group E. We can ask how E is distinguished from the Law + Order and L3 groups previously identified.
- Members of Group E have less confidence in government than members of either other establishment types (L3 and Law +Order).
- Group E has less confidence in churches, police, and armed forces than our Law + Order friends, but more than L3.
- Group E has less confidence in police than than Law + Order, but roughly the same as L3.
One could summarize this behavior by saying that members of group E are less authoritarian than members of our Law + Order group, but are more cynical than members of L3.
The survey contains a multitude of other questions, all of which reveal interesting patterns in the data.
For example, one of the questions asks to what extent the respondent feels that he/she is in control of his/her life, where a low score reflects little control over ones life and a high one represents a great deal of control. One can now color the model by the average value, and the resulting network is shown below.
Note that the coloring corresponds very closely with the vertical coordinate. Respondents near the bottom feel they have a high degree of control over their lives, and ones near the top have little control. Trump likely draws from all but the Red and Orange in this group, again, a larger group than his competitors and consistent with his populist message. What is interesting is that the answer to this question is not precisely synchronized with either the right/left value or the establishment/non-establishment value (since those correspond to diagonal lines) but rather with a combination of those two, which together create a vertical line.
As a reminder, those at the top of the network do not have confidence in institutions whereas those at the bottom generally do.
The takeaway here is that those who believe in institutions also believe that they have control over their lives. While this seems intuitive on some level the effort to elicit this from the data would have been particularly time consuming. To do it visually from the topological model takes only seconds.
Another question concerns immigrants. The survey ask the following question: “when jobs are scarce, employers should give preference to native born people over immigrants.” There are three choices, with 1 denoting agreement, 3 denoting disagreement, and 2 denoting neither. The coloring of the model by the average value of this quantity over a node looks as follows.
In this case, the coloring appears to move roughly from left to right, with the left end of the “establishment” side disagreeing strongly with the premise, and the right end of the “non-establishment” group agreeing strongly.
This too is expected. The “Limousine Liberals” do not agree that jobs should go to Americans first, whereas the “evangelicals” believe they should.
It is interesting to compare the outcome of this question with the degree of confidence in the United Nations. The coloring by that confidence looks like this.
The coloring is almost reversed, which makes sense since a high number for the United Nations question corresponds to a low confidence level.
Thus the establishment left (L3) hold the UN in high regard whereas the “evangelicals” do not. Here again, Trump has sway, but as noted earlier, has to compete with other candidates on this issue.
Again, given a relatively small but complex dataset such as this, we can continue to dig deeper but will stop here for now. What we think we have done, however, is identify sections of the electorate that Trump is drawing from with his message – and it would appear that he has the capacity to draw more broadly than his peers perhaps underscoring his success. Further, with Trump, there is a bit of a self-fulfilling prophecy as well – the more coverage he gets, the more likely he is to find those disaffected voters from pockets that are outside of the mainstream political process.
It will be interesting to see how that translates to a primary process that generally only includes a small portion of the electorate – in this case the bottom right of the network.