[This article was first published on Engaging Market Research, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Dichotomies come easily to us, especially when they are caricatures as shown in this cartoon. These personality types do seem real, and without much difficulty, we can anticipate how they might react in different situations. For example, if we were to give our Type A and Type B vacationers a checklist to indicate what activities they would like to do on their next trip, we would expect to observe two different patterns. Type A would select the more adventurous and challenging activities, while Type B would pick the opposite. That is, if we were to array the activities from relaxing to active, our Type A would be marking the more active items with the relaxing portion of the scale being checked by our Type B respondent. Although our example is hypothetical, market segmentation in tourism is an ongoing research area as you will see if you follow the link to an article by Sara Dolnicar, whose name is associated with several R packages.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Yet, personality type does not explain all the heterogeneity we observe. We would expect a different pattern of check marks for Type A and B, but we would not be surprised if “type” were also a matter of degree with respondents more or less reflecting their type. The more “devout” Type A checks only the most active items and rejects the less active along with all the passive activities. Similarly, the more “pure” Type B is likely to want only the most relaxing activities. Thus, we might need both personality type (categorical) and intensity of respective type (continuous) in order to explain all the observed heterogeneity.
Should we think of this dimension as graded membership in a personality type so that we need to represent personality type by a probability rather than all or none types? I would argue that vacation preference can be described more productively by two latent variables: a categorical framing or orientation (Type A vs. Type B) and a continuous latent trait controlling the expression of the type (perhaps a combination of prior experience plus risk aversion and public visibility). It’s a two-step process. One picks a theme and then decides how far to go.
Of course, some customers might be “compromisers” and be searching for the ideal vacation that would balance active and relaxing, that is, the “just right” vacation. In just a case we would need an ideal point item response model (e.g., 2013 US Senate ideal points and R code for the analysis). However, to keep the presentation simple, we will assume that our vacationers want only a short trip with single theme: either a relaxing break or an active getaway. To clarify, voting for a U.S. Senator is a compromise because I select (vote for) a single individual who is closest to my positions on a large assortment of issues. Alternatively, a focused purchase such as a short vacation can seek to accomplish only one goal (e.g., most exciting, most relaxing, most educational, or most romantic).
In a previous post I showed how brand perceptions could be modeled using item response theory. Individuals do see brands differently, but those differences raise or lower all the attributes together rather than changing the rank ordering of the items. For instance, regardless of your like or dislike for BMWs, everyone would tend to see the attribute “well-engineered” as more associated with the car maker than “reasonably priced.” Brand perceptions are constrained by an associative network connecting the attributes so that “a rising tide lifts all boats”. As we have seen in our Type A-Type B example, this is not the case with preferences, which can often be ordered in reverse directions.
Where’s the Latent Variable Mixture?
Heterogeneity among our respondents is explained by two latent variables. We cannot observe personality type, but we believe that it takes one of two possible values: Type A or Type B. If I were to select a respondent at random, they would be either Type A or Type B. In the case of our vacationers, being Type A or Type B would mean that they would see their vacation as an opportunity for something challenging or as a chance to relax. Given their personality type frame, our respondents need to decide next the intensity of their commitment. Because intensity is a continuous latent variable, we have a latent variable mixture.
Let’s look at some R code and see if some concrete data will help clarify the discussion. We can start with a checklist containing 8 items ranging from relaxing to active, and we will need two groups of respondents for our Type A and Type B personalities. The sim.rasch() function from the psych package will work.
library(psych) set.seed(16566) TypeA<-sim.rasch(nvar=8, n=100, d=c(+2.0, +1.5, +1.0, +0.5, -0.5, -1.0, -1.5, -2.0)) TypeB<-sim.rasch(nvar=8, n=100, d=c(-2.0, -1.5, -1.0, -0.5, +0.5, +1.0, +1.5, +2.0)) more<-rbind(TypeA$items,TypeB$items) segment<-c(rep(1,100),rep(2,100)) apply(more,2,table) apply(TypeA$items, 2, table) apply(TypeB$items, 2, table)
The sim.rasch() function begins with a series of default values. By default, our n=100 Type A and Type B vacationers come from a latent trait that is normally distributed with mean 0 and standard deviation 1. This latent trait can be thought of intensity, as we will soon see. So far the two types are the same, that is, two random samples from the same normal latent distribution. Their only difference is in d, which stands for difficulty. The term “difficulty” comes to us from educational testing where the latent trait is ability. A student has a certain amount of latent ability, and each item has a difficulty that “tests” the student’s latent ability. Because latent ability and item difficulty are measured on the same scale, a student with average ability (mean=0) has a 50-50 chance of answering correctly an item of average difficulty (d=0). If d is a negative value, then the item is easier and our average student has a better than 50-50 chance of getting it right. On the other hand, items with positive d are more difficult and pose a greater challenge for our average student.
In our example, the eight activities flow from more relaxing to more active. Let’s take a look at how an average Type A and Type B would respond to the checklist. Our average Type A has a latent intensity of 0, so the first item is a severe test with d=+2, and they are not likely to check it. The opposite is true for our average Type B respondent since d=-2 for their personality type. Checking relaxing items is easy for Type B and hard for Type A. And this pattern continues with difficulty moving in opposite directions for our two types until we reach item 8, which you will recall is the most active activity. It is an easy item for Type A (d=-2) because they like active. It is a difficult item for Type B (d=+2) because they dislike active. As a reminder, if our vacationers were seeking balance or if our items were too extreme (i.e., more challenging than Type As wanted or more relaxing than sought by our Type Bs), we would be fitting an ideal point model.
The sim.rasch() function stores its results in a list so that you need to access $items in order to retrieve the response patterns of zeros and ones for the two groups. If you run the apply functions to get your tables (see below), you will see the frequencies of checks (response=1) is increasing across the 8 items for Type A and decreasing for Type B, as one might have expected. Clearly, with real data we know none of this and all we have is a mixture of unobserved types of unobserved intensity.
Clustering and Other Separation Mechanisms
Unaware that our sample is a mixture of two separate personality types, we would be misled looking at the aggregate descriptive statistics. The total column suggests that all the activities are almost equally appealing when clearly that is not the case.
Number Checking Each Item< o:p> | |||
Item< o:p> | Total< o:p> | Type A< o:p> | Type B< o:p> |
V1< o:p> | 100< o:p> | 14< o:p> | 86< o:p> |
V2< o:p> | 103< o:p> | 22< o:p> | 81< o:p> |
V3< o:p> | 96< o:p> | 28< o:p> | 68< o:p> |
V4< o:p> | 104< o:p> | 43< o:p> | 61< o:p> |
V5< o:p> | 94< o:p> | 60< o:p> | 34< o:p> |
V6< o:p> | 114< o:p> | 79< o:p> | 35< o:p> |
V7< o:p> | 94< o:p> | 75< o:p> | 19< o:p> |
V8< o:p> | 105< o:p> | 87< o:p> | 18< o:p> |
n< o:p> | 200< o:p> | 100< o:p> | 100< o:p> |
To get a better picture of the mixture of these two groups, we can look at the plot of all the respondents in the two-dimensional space formed by the first two principal components. This is fairly easy to do in R using prcomp() to calculate the principal component scores and plotting the unobserved personality type (which we only know because the data are simulated) along with arrows representing the projection of the 8 items onto this space.
pc<-prcomp(more) plot(pc$x[,1:2],type="n") text(pc$x[,1:2],col=segment,labels=segment) arrows(0,0,pc$rotation[,1],pc$rotation[,2], length=.1) text(pc$rotation[,1:2],labels=rownames(pc$rotation),cex=.75)