Warning: Sawtooth’s MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!

[This article was first published on Engaging Market Research, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sawtooth Software has created a good deal of confusion with its latest sales video published on YouTube.  I was contacted last week by one of my clients who had seen the video and wanted to know why I was not using such a powerful technique for measuring attribute importance.  “It gives you a ratio scale,”  he kept repeating.  And that is what Sawtooth claims. At about nine minutes into the video, we are told that Maximum Difference Scaling yields a ratio scale where a feature with a score of ten is twice as important as a feature with a score of five.

Where should I begin?  Perhaps the best approach is simply to look at the example that Sawtooth provides in the video.  Sawtooth begins with a list of  the following 10 features that might be important to customers when selecting a fast food restaurant:

1. Clean eating areas (floors, tables, and chairs),
2. Clean bathrooms,
3. Has health food items on the menu,
4. Typical wait time is about 5 minutes in line,
5. Prominently shows calorie information on menu,
6. Prices are very reasonable,
7. Your order is always completed correctly,
8. Has a play area for children,
9. Food tastes wonderful, and
10. Restaurant gives generously to charities.

Sawtooth argues in the video that it becomes impractical for a respondent to rank order more than about seven items.  Although that might be true for a phone interview, MaxDiff surveys are administered on the internet.  How hard is it to present all 10 features and ask a respondent which is the most important?  Let’s pretend we are respondents with children.  That was easy; “has a play area for children” is the most important.  Now the screen is refreshed with only the remaining nine features, and the respondent is again asked to select the most important feature.  This continues until all the features have been rank ordered.

What if there were 20 features?  The task gets longer and respondents might require some incentive, but the task does not become more difficult.  Picking the best of a list becomes more time consuming at the list gets longer.  However, the cognitive demands of the task remain the same.  One works their way down the list, comparing each new feature to whatever feature was last considered to be the most important.  For example, our hypothetical respondent has selected play area for children as the most important feature.  If another feature were added to the list, they would compare the new feature to play area for children and decide to keep play area or replace it with the new feature.

Sawtooth argues that such a rank ordering is impractical and substitutes a series of best and worst choices from a reduced set of features.  For example, the first four features might be presented to a respondent who is asked to select the most and least important among only those four features.  Since Sawtooth explains their data collection and analysis procedures in detail on their website, I will simply provide the link here and make a couple of points.  First, one needs a lot of these best-worst selections from sets of four features in order to make individual estimates (think incomplete block designs).  Second, it is not the most realistic or interesting task (if you do not believe me, go to the previous link and take the web survey example).  Consequently, only a limited number of best-worst sets are presented to any one respondent, and individual estimates are calculated using hierarchical Bayesian estimation.

This is where most get lost if they are not statisticians.  The video claims that hierarchical Bayes yields ratio scale estimates that sum to a constant.  Obviously, this cannot be correct, not for ratio or interval scaling.  The ratio scale claim refers to the finding that one feature might be selected twice as often as another from the list.  But that “ratio” depends on what else is in the list.  It is not a characteristic of the feature alone.  If you change the list or change the wording of the items in the list, you will get a different result.  For example, what if the wording for the price feature were changed from “very reasonable” to just “reasonable” without the adverb?  How much does the ranking depend on the superlatives used to modify the feature?  Everything is relative.  All the scores from Sawtooth’s MaxDiff are relative to the features included in the set and the way they are described (e.g., vividness and ease of affective imagery will also have an impact). 

To make it clear that MaxDiff is nothing more than a rank ordering of the features, consider the following thought experiment.  Suppose that you went through the feature list and rank ordered the 10 features.  Now you are given a set of four features, but I will use your rankings to describe the features where 1=most important and 10=least important.  If the set included features ranked 3rd, 5th, 7th, and 10th, then you would select feature 3 at the most important and feature 10 as the least important.  We could do this forever, because selecting the best and worst depends only on the rank ordering of the features.  Moreover, it does not matter how close or far way the features are from each other; only their rankings matter.

Actually, Sawtooth has recognized this fact for some time.  In a 2009 technical report, which suggested a possible “fix” called the dual response, they admitted that “MaxDiff measures only relative desirability among the items.”  This supposed “adjustment” was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior.  All we know is the rank ordering of the features, which we will obtain even if no feature is sufficiently important in the marketplace to change intention or behavior.  Such research has become commonplace with credit card reward features.  It is easy to imagine rank ordering a list of 10 credit card reward features that would provide no incentive to apply for a new card.  It is a product design technique that creates the best product that no one will buy.  [The effort to “fix” MaxDiff continues as you can see in the proceedings of the 2012 Sawtooth Conference.]

The R package composition

As noted by Karl Pearson some 115 years ago, the constraint that a set of variables sum to some constant value has consequences.  Simply put, if the scores for the 10 features sum to 100, then I have only nine degrees of freedom because I can calculate the value of the any one feature once I know the values of the other nine features.  As Pearson noted in 1897, this linear dependency creates a spurious negative correlation among the variables.  Too often, it is simply ignored and the data analyzed as if there were no dependency.  This is an unwise choice, as you can see from this link to The Factor Analysis of Ipsative Measures.

In the social sciences we call this type of data ipsative.  In geology it is called compostional data (e.g., percent contribution of basic minerals in a rock sum to 100%).  R has a package called composition that provides a comprehensive treatment of such data.  However, please be advised that the analysis of ipsative or compositional data can be quite demanding, even for those familiar with simplex geometry.  Still, it is an area that has been studied recently by Michael Greenacre (Biplots of Compositonal Data) and by Anna Brown (Item Response Modeling of Forced-Choice Questionnaires).

Forced choice or ranking is appealing because it requires respondents to make trade-offs.  This is useful when we believe that respondents are simply rating everything high or low because they are reluctant to tell us everything they know.  However, infrequent users do tend to rate everything as less important because they do not use the product that often and most of the features are not important to them.  On the other hand, heavy users find lots of features to be important since they use the product all the time and for lots of different purposes.

Finally, we need to remember that these importance measures are self reports, and self reports do not have a good track report.  Respondents often do not know what is important to them.  For example, how much do people know about what contributes to their purchase of wine?  Can they tell us if the label on the wine bottle is important?  Mueller and Lockshin compared Best-Worst Scaling (another name for MaxDiff) with a choice modeling task.  MaxDiff tells us that the wine label is not important, but the label had a strong impact on which wine was selected in the choice study.  We should never forget the very real limitations of self-stated importance.

To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)