Site icon R-bloggers

Latent Variable Mixture Modeling: When Heterogeneity Requires Both Categories and Dimensions

[This article was first published on Engaging Market Research, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Dichotomies come easily to us, especially when they are caricatures as shown in this cartoon.  These personality types do seem real, and without much difficulty, we can anticipate how they might react in different situations.  For example, if we were to give our Type A and Type B vacationers a checklist to indicate what activities they would like to do on their next trip, we would expect to observe two different patterns.  Type A would select the more adventurous and challenging activities, while Type B would pick the opposite. That is, if we were to array the activities from relaxing to active, our Type A would be marking the more active items with the relaxing portion of the scale being checked by our Type B respondent.  Although our example is hypothetical, market segmentation in tourism is an ongoing research area as you will see if you follow the link to an article by Sara Dolnicar, whose name is associated with several R packages.

Yet, personality type does not explain all the heterogeneity we observe.  We would expect a different pattern of check marks for Type A and B, but we would not be surprised if “type” were also a matter of degree with respondents more or less reflecting their type.  The more “devout” Type A checks only the most active items and rejects the less active along with all the passive activities.  Similarly, the more “pure” Type B is likely to want only the most relaxing activities.  Thus, we might need both personality type (categorical) and intensity of respective type (continuous) in order to explain all the observed heterogeneity.

Should we think of this dimension as graded membership in a personality type so that we need to represent personality type by a probability rather than all or none types?  I would argue that vacation preference can be described more productively by two latent variables:  a categorical framing or orientation (Type A vs. Type B) and a continuous latent trait controlling the expression of the type (perhaps a combination of prior experience plus risk aversion and public visibility).  It’s a two-step process.  One picks a theme and then decides how far to go.

Of course, some customers might be “compromisers” and be searching for the ideal vacation that would balance active and relaxing, that is, the “just right” vacation.  In just a case we would need an ideal point item response model (e.g., 2013 US Senate ideal points and R code for the analysis).  However, to keep the presentation simple, we will assume that our vacationers want only a short trip with single theme:  either a relaxing break or an active getaway.  To clarify, voting for a U.S. Senator is a compromise because I select (vote for) a single individual who is closest to my positions on a large assortment of issues.  Alternatively, a focused purchase such as a short vacation can seek to accomplish only one goal (e.g., most exciting, most relaxing, most educational, or most romantic).

In a previous post I showed how brand perceptions could be modeled using item response theory.  Individuals do see brands differently, but those differences raise or lower all the attributes together rather than changing the rank ordering of the items.  For instance, regardless of your like or dislike for BMWs, everyone would tend to see the attribute “well-engineered” as more associated with the car maker than “reasonably priced.”  Brand perceptions are constrained by an associative network connecting the attributes so that “a rising tide lifts all boats”.  As we have seen in our Type A-Type B example, this is not the case with preferences, which can often be ordered in reverse directions.

Where’s the Latent Variable Mixture?

Heterogeneity among our respondents is explained by two latent variables.  We cannot observe personality type, but we believe that it takes one of two possible values:  Type A or Type B.  If I were to select a respondent at random, they would be either Type A or Type B.  In the case of our vacationers, being Type A or Type B would mean that they would see their vacation as an opportunity for something challenging or as a chance to relax.  Given their personality type frame, our respondents need to decide next the intensity of their commitment.  Because intensity is a continuous latent variable, we have a latent variable mixture.

Let’s look at some R code and see if some concrete data will help clarify the discussion.  We can start with a checklist containing 8 items ranging from relaxing to active, and we will need two groups of respondents for our Type A and Type B personalities.  The sim.rasch() function from the psych package will work.

library(psych)
set.seed(16566)
TypeA<-sim.rasch(nvar=8, n=100, 
  d=c(+2.0, +1.5, +1.0, +0.5, -0.5, -1.0, -1.5, -2.0))
TypeB<-sim.rasch(nvar=8, n=100, 
  d=c(-2.0, -1.5, -1.0, -0.5, +0.5, +1.0, +1.5, +2.0))
 
more<-rbind(TypeA$items,TypeB$items)
segment<-c(rep(1,100),rep(2,100))
apply(more,2,table)
apply(TypeA$items, 2, table)
apply(TypeB$items, 2, table)


The sim.rasch() function begins with a series of default values.  By default, our n=100 Type A and Type B vacationers come from a latent trait that is normally distributed with mean 0 and standard deviation 1.  This latent trait can be thought of intensity, as we will soon see.  So far the two types are the same, that is, two random samples from the same normal latent distribution.  Their only difference is in d, which stands for difficulty.  The term “difficulty” comes to us from educational testing where the latent trait is ability.  A student has a certain amount of latent ability, and each item has a difficulty that “tests” the student’s latent ability.  Because latent ability and item difficulty are measured on the same scale, a student with average ability (mean=0) has a 50-50 chance of answering correctly an item of average difficulty (d=0).  If d is a negative value, then the item is easier and our average student has a better than 50-50 chance of getting it right.  On the other hand, items with positive d are more difficult and pose a greater challenge for our average student.

In our example, the eight activities flow from more relaxing to more active.  Let’s take a look at how an average Type A and Type B would respond to the checklist.  Our average Type A has a latent intensity of 0, so the first item is a severe test with d=+2, and they are not likely to check it.  The opposite is true for our average Type B respondent since d=-2 for their personality type.  Checking relaxing items is easy for Type B and hard for Type A.  And this pattern continues with difficulty moving in opposite directions for our two types until we reach item 8, which you will recall is the most active activity.  It is an easy item for Type A (d=-2) because they like active.  It is a difficult item for Type B (d=+2) because they dislike active.  As a reminder, if our vacationers were seeking balance or if our items were too extreme (i.e., more challenging than Type As wanted or more relaxing than sought by our Type Bs), we would be fitting an ideal point model.

The sim.rasch() function stores its results in a list so that you need to access $items in order to retrieve the response patterns of zeros and ones for the two groups.  If you run the apply functions to get your tables (see below), you will see the frequencies of checks (response=1) is increasing across the 8 items for Type A and decreasing for Type B, as one might have expected.  Clearly, with real data we know none of this and all we have is a mixture of unobserved types of unobserved intensity.

Clustering and Other Separation Mechanisms

Unaware that our sample is a mixture of two separate personality types, we would be misled looking at the aggregate descriptive statistics.  The total column suggests that all the activities are almost equally appealing when clearly that is not the case.

            Number Checking Each Item< o:p>
Item< o:p>
Total< o:p>
Type A< o:p>
Type B< o:p>
V1< o:p>
100< o:p>
14< o:p>
86< o:p>
V2< o:p>
103< o:p>
22< o:p>
81< o:p>
V3< o:p>
96< o:p>
28< o:p>
68< o:p>
V4< o:p>
104< o:p>
43< o:p>
61< o:p>
V5< o:p>
94< o:p>
60< o:p>
34< o:p>
V6< o:p>
114< o:p>
79< o:p>
35< o:p>
V7< o:p>
94< o:p>
75< o:p>
19< o:p>
V8< o:p>
105< o:p>
87< o:p>
18< o:p>
n< o:p>
200< o:p>
100< o:p>
100< o:p>

To get a better picture of the mixture of these two groups, we can look at the plot of all the respondents in the two-dimensional space formed by the first two principal components.  This is fairly easy to do in R using prcomp() to calculate the principal component scores and plotting the unobserved personality type (which we only know because the data are simulated) along with arrows representing the projection of the 8 items onto this space.

pc<-prcomp(more)
plot(pc$x[,1:2],type="n")
text(pc$x[,1:2],col=segment,labels=segment)
arrows(0,0,pc$rotation[,1],pc$rotation[,2], length=.1)
text(pc$rotation[,1:2],labels=rownames(pc$rotation),cex=.75)


The resulting plot shows the separation between our two personality type (labeled 1 and 2 for A and B, respectively) and the distortion in the covariance structure that splits the 8 items into two factors (the first 4 items vs. the last 4 items).


 Obviously, we need to “unmix” our data, and as you might have guessed from the well-defined separation in the above plot, any cluster analysis ought to be able to successfully recover the two segments (had we known that the number of clusters was two).   K-means works fine, correctly classifying 94% of the respondents.

Had we stopped with a single categorical latent variable, however, we would have lost the ordering of our items.  This is the essence of item response theory.  Saying “It’s the ordering of the items, stupid” might be a bit strong but may be required to focus attention on the need for an item response model.  In addition to a categorical type, our data require a dimension or continuous latent variable that uses the same scale to differentiate simultaneously among items and individuals.  Categories alone are not sufficient to describe fully the heterogeneity in our data.

The R package psychomix

The R package psychomix provides an accessible introduction to the topic of latent variable mixture modeling without needing to understand all the details.  However, Muthen provides a readable overview for those wanting a more comprehensive summary.  Searching his chapter for “IRT” should help one see where psychomix fits into the larger framework of latent variable mixture modeling.

We will be using the raschmix() function to test for 1 to 3 mixtures of different difficulty parameters.   Obviously, we never know the respondents personality type with real data.  In fact, we may not know if we have a mixture of different types at all.  All we have is the response patterns of check marks across the 8 items.  The function raschmix() must help us decide how many, if any, latent categories and the item difficulty parameters in each category.  Fortunately, it all becomes clear with an example, so here is the R code to run the analysis.


library(psychomix)
mixture<-raschmix(more, k=1:3)
 
## inspect results
mixture
plot(mixture)
 
## select best BIC model
BIC(mixture)
best <- getModel(mixture, which = "BIC")
summary(best)
 
group<-clusters(best)
person<-apply(more,1,sum)
table(group,segment[(person>0 & person<8)])


At a minimum, the function raschmix() needs a data matrix [not a data frame, so use as.matrix()] and the number of mixtures to be tested.  We have set k=1:3, so that we can compare the BIC for 1, 2, and 3 mixtures.  The results have been stored in a list called mixture, and one extracts information from the list using methods.  For example, typing “mixture” (the name of the list object holding the results) will produce the summary fit statistics.

iter< o:p>
converged< o:p>
k< o:p>
k0< o:p>
logLik< o:p>
AIC< o:p>
BIC< o:p>
ICL< o:p>
1< o:p>
2< o:p>
TRUE< o:p>
1< o:p>
1< o:p>
-1096< o:p>
2223< o:p>
2272< o:p>
2222< o:p>
2< o:p>
10< o:p>
TRUE< o:p>
2< o:p>
2< o:p>
-949< o:p>
1956< o:p>
2051< o:p>
2011< o:p>
3< o:p>
76< o:p>
TRUE< o:p>
3< o:p>
3< o:p>
-938< o:p>
1963< o:p>
2103< o:p>
2083< o:p>

Although one should use indexes such as the BIC cautiously, these statistics suggest that there are two mixtures.  Because raschmix() relies on an expectation-maximum (EM) algorithm, you ought not be surprised if you get somewhat different results when you run this code.  In fact, the solution for the 3-mixture model may not converge under the default 100 iteration limit.  We use the getModel() method to extract the two mixture model with the highest BIC and print out the solution with summary().

prior< o:p>
size< o:p>
post>0< o:p>
ratio< o:p>
Comp.1< o:p>
0.503< o:p>
97< o:p>
178< o:p>
0.545< o:p>
Comp.2< o:p>
0.497< o:p>
97< o:p>
169< o:p>
0.574< o:p>
Item Parameters:< o:p>
Comp.1< o:p>
Comp.2< o:p>
V1< o:p>
2.35< o:p>
-2.24< o:p>
V2< o:p>
1.64< o:p>
-1.72< o:p>
V3< o:p>
1.21< o:p>
-0.83< o:p>
V4< o:p>
0.38< o:p>
-0.41< o:p>
V5< o:p>
-0.35< o:p>
0.79< o:p>
V6< o:p>
-1.58< o:p>
0.80< o:p>
V7< o:p>
-1.17< o:p>
1.65< o:p>
V8< o:p>
-2.47< o:p>
1.97< o:p>

We started with 200 respondents, but six respondents were removed because three checked none of the items and three checked all of the items.  That is why the sizes for the two mixture components do not sum to 200.  The item parameters are the item difficulties that we specified with our d argument in sim.rasch() when we randomly generated the data.  The first component looks like our Type A personality with the easiest to check activities toward the end of the list with the negative difficulty parameters.  Type B is the opposite with the more passive activities at the beginning of the list being the easiest to check because they are the most preferred by this segment.

Finally, the last three lines of R code first identifies the cluster membership for every respondent using the psychomix method clusters() and then verifies its accuracy with tables().  As we saw with k-means earlier, we are able to correctly identify almost all the personality types when the two segments are well-separated by the reverse ordering of their difficulty parameters.

Of course, real data can be considerably messier than our simulation with sim.rasch(), requiring us to think hard before we start the analysis. In particular, items must be carefully selected since we are attempting to separate respondents using different response generation processes based solely on their pattern of checked boxes.  Fortunately, markets have an underlying structure that helps us understand how consumer heterogeneity is formed and maintained.


To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.