Archetypal Analysis: Similarity Defined by Distances from Contrasting Ideals

Joel Cadwell

7 years ago

[This article was first published on Engaging Market Research, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Carl Jung was at least partially correct. We do tend to think in terms of the extremes as shown in this archetypal wheel with rulers versus outlaws and heroes versus caregivers at different ends of bipolar dimensions. Happily, we are not required to accept Jung’s collective unconscious to explain this tendency. Metaphorical thinking works just fine. For example, why not separate all political players into two camps based on our earliest experiences with powerful others: liberals as caregivers (supportive mothers) and conservatives as heroes (demanding and punishing fathers)?

Political ideology was selected as my example because of its universality and because R offers so many ways of analyzing such data. Probably the quickest introduction is through the voteview blog, which relies on a dimensional representation of our liberal and conservative archetypes (such as the following figure showing the polarization in the U.S. Congress).

Two points define a line, and it is seldom difficult to image a continuum between any two bipolar types, in this case between liberals and conservatives. Do we have a dimension or categories? It depends on any separation within the density distribution. Obviously, the distributions in the both the House (light blue Democrats and light red Republicans) and the Senate (dark blue and red) are at least bimodal. Thus, we are free to represent the same data as points along the liberal-conservative dimension or as ratios of mixture coefficients for the two clusters (i.e., odds ratio of membership likelihood in the red or blue clusters).

The mclust R code and a more complete discussion can be found in an earlier post using likelihood to recommend as the dimension and promoters versus distractors as the clusters. In order that there is no misunderstanding, the liberal-conservative continuum is a latent variable derived from a series of votes on a range of issues with the R package basicspace. Recommendation, on the other hand, is an observed likelihood rating along an 11-point scale from 0 to 10. In both cases, we are looking for evidence of separation as if we had a mixture of different generative models.

Given the above figure, liberal and conservative archetypes would be located toward the end points of this scale. That is, instead of describing the two clusters using their centroids positioned near the “humps” in the two curves, archetypal analysis attempts to describe political ideology in terms of idealized liberals and conservatives. These are not necessarily the most extreme points, as the archetypes R package makes clear with displays such as the following showing both the convex hull of the most extreme data points in gray and archetypes as the vertices of the internal red triangle. Three archetypes are necessary to locate any data point in the two-dimensional space.

Before continuing, we ought to review a few examples so that we understand what we mean by an archetype. If you live in a region that receives snow or just watch a lot of Christmas movies and I told you that it was a perfect winter day, that picture you just imagined is an ideal or archetype. All winter days can be described in terms of their distance from that ideal. The same can be said of spring, fall and summer days. If you are familiar with smoggy days, as was Leo Breiman when he introduced archetypal analysis to describe ozone levels in Los Angeles, then you know what a smog alert feels like. We use the shorthand provided by shared archetypes to summarize succinctly a large amount of information.

As you may have noticed, I have interchanged the words “ideal” and “archetype” in my writing. This was deliberate since archetypes tend to be seen when describing ideal instantiation of a category rather than the average category member. Thus, when asked to tell you about a specific athlete, such as a basketball center, you are not likely to describe the average center nor the greatest center that ever played the game. Instead, one thinks about the role that the center plays in the game, lists those defensive and offensive contributions, and distinguishes this position from the other players on the team. Manual Eugster demonstrates how the R package archetypes would uncover such archetypal athletes.

Of course, there is no requirement that forces us to retrieve goal-derived categories and their associated ideals from memory. We could evaluate “on a curve” and think about the average basketball center, as we might if asked to guess the average height of a NBA center. Yet, the center in basketball serves a purpose within a team of other players with other purposes. Not unlike the archetypal wheel that introduced this post, the center is defined in contrast to the other positions on the team. The rules of the game play a role in the clustering of players with similarity measured not by distance from the average but distance from the ideal. Therefore, two centers are similar because they play similar roles in the game, that is, both are close to the ideal center. Moreover, they are seen as even more alike when guards are added into the mix. Similarity is shaped by the context of competing archetypes or ideals.

In one of my first posts, I demonstrated how the R package archetypes would identify features usage types. Repeatedly, we find that usage intensity has the greatest impact differentiating the light from the heavy user. I have reproduced a figure from that previous post showing both the k-means clusters (the K’s) and the position of the archetypes (the A’s).

The data are 10 feature usage ratings that are projected onto the plane formed by the first two principal components. The points are respondents and the arrows represent the features. All the arrows point to the right indicating that the first principal component reflects usage intensity with heavier usage toward the right in the direction that all the arrows point. As you know, the angles between arrows reflect their correlations, so that the two bundles of arrows suggest a two-factor solution. We can call such a pattern a bifactor solution: a general factor separating light and heavy users and two specific factors distinguishing between those more involved with each of the two bundles of feature sets. It is worth your time to become familiar with this factor structure because it reappears frequently with usage data, as well as preference and satisfaction.

Do you see clusters of data points in the above scatterplot? The three centroids from a K-means clustering follows the path of the first principal component with a low usage (K2), a medium usage (K1) and a high usage (K3) segment. Personally, I find it difficult to separate out clusters in this data cloud. I see a fan-spread distribution with the amount of variation on the second dimension dependent on the value of the first dimension, that is, little or no feature usage among light users and increasing separation of the two feature bundles for heavier users. The archetypes reveal this pattern by forming a triangle with vertices at no usage (A3), bundle A1 usage and bundle A2 usage. K-means yields a restatement of usage intensity along the first dimension, while archetypal analysis summarizes the data as contained with the triangle formed by three usage types.

To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.