Site icon R-bloggers

Moving Beyond CTR: Better Recommendations Through Human Evaluation

[This article was first published on Edwin Chen's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Imagine you’re building a recommendation algorithm for your new online site. How do you measure its quality, to make sure that it’s sending users relevant and personalized content? Click-through rate may be your initial hope…but after a bit of thought, it’s not clear that it’s the best metric after all.

Take Google’s search engine. In many cases, improving the quality of search results will decrease CTR! For example, the ideal scenario for queries like When was Barack Obama born? is that users never have to click, since the question should be answered on the page itself.

Or take Twitter, who one day might want to recommend you interesting tweets. Metrics like CTR, or even number of favorites and retweets, will probably optimize for showing quick one-liners and pictures of funny cats. But is a Reddit-like site what Twitter really wants to be? Twitter, for many people, started out as a news site, so users may prefer seeing links to deeper and more interesting content, even if they’re less likely to click on each individual suggestion overall.

Or take eBay, who wants to help you find the products you want to buy. Is CTR a good measure? Perhaps not: more clicks may be an indication that you’re having trouble finding what you’re looking for. What about revenue? That might not be ideal either: from the user perspective, you want to make your purchases at the lowest possible price, and by optimizing for revenue, eBay may be turning you towards more expensive products that make you a less frequent customer in the long run.

And so on.

So on many online sites, it’s unclear how to measure the quality of personalization and recommendations using metrics like CTR, or revenue, or dwell time, or whatever. What’s an engineer to do?

Well, consider the fact that many of these are relevance algorithms. Google wants to show you relevant search results. Twitter wants to show you relevant tweets and ads. Netflix wants to recommend you relevant movies. LinkedIn wants to find you relevant people to follow. So why, so often, do we never try to measure the relevance of our models?

I’m a big fan of man-in-the-machine techniques, so to get around this problem, I’m going to talk about a human evaluation approach to measuring the performance of personalization and discovery products. In particular, I’ll use the example of related book suggestions on Amazon as I walk through the rest of this post.

Amazon, and Moving Beyond Log-Based Metrics

(Let’s continue motivating a bit why log-based metrics are often imperfect measures of relevance and quality, as this is an important but difficult-to-understand point.)

So take Amazon’s Customers Who Bought This Item Also Bought feature, which tries to show you related books.

To measure its effectiveness, the standard approach is to run a live experiment and measure the change in metrics like revenue or CTR.

But imagine we suddenly replace all of Amazon’s book suggestions with pornography. What’s going to happen? CTR’s likely to shoot through the roof!

Or suppose that we replace all of Amazon’s related books with shinier, more expensive items. Again, CTR and revenue are likely to increase, as the flashier content draws eyeballs. But is this anything but a short-term boost? Perhaps the change decreases total sales in the long run, as customers start to find Amazon too expensive for their tastes and move to other marketplaces.

Scenarios like these are the machine learning analogue of turning ads into blinking marquees. While they might increase clicks and views initially, they’re probably not optimizing user happiness or site quality for the future. So how can we avoid them, and ensure that the quality of our suggestions remains consistently high? This is a related books algorithm, after all – so why, by sticking to a live experiment and metrics like CTR, are we nowhere inspecting the relatedness of our recommendations?

Human Evaluation

Solution: let’s inject humans into the process. Computers can’t measure relatedness (if they could, we’d be done), but people of course can.

For example, in the screenshot below, I asked a worker (on a crowdsourcing platform I’ve built on my own) to rate the first three Customers Who Bought This Also Bought suggestions for a book on barnesandnoble.com.

(Copying the text from the screenshot:

The recommendations are decent, but already we see a couple ways to improve them:

CTR and revenue certainly wouldn’t give us this level of information, and it’s not clear that they could even tell us our algorithms are producing irrelevant suggestions in the first place. Nowhere does the related scroll panel make it clear that two of the books are part of a series, so the CTR on those two books would be just as high as if they were indeed the series introductions. And if revenue is low, it’s not clear whether it’s because the suggestions are bad or because, separately, our pricing algorithms need improvement.

So in general, here’s one way to understand the quality of a Do This Next algorithm:

  1. Take a bunch of items (e.g., books if we’re Amazon or Barnes & Noble), and generate their related brethren.
  2. Send these < item, related items> pairs off to a bunch of judges (say, by using a crowdsourcing platform like Mechanical Turk), and ask them to rate their relevance.
  3. Analyze the data that comes back.

Algorithmic Relevance on Amazon, Barnes & Noble, and Google

Let’s make this process concrete. Pretend I’m a newly minted VP of What Customers Also Buy at Amazon, and I want to understand the product’s flaws and stars.

I started by going to Mechanical Turk and asking a couple hundred workers to take a book they enjoyed in the past year, and to find it on Amazon. They’d then take the first three related book suggestions from a different author, rate them on the following scale, and explain their ratings.

(Note: I usually prefer a three-point or five-point Likert scale with a “neutral” option, but I was feeling a little wild.)

For example, here’s how a rater reviewed the related books for Anne Rice’s The Wolves of Midwinter.

So how good are Amazon’s recommendations? Quite good, in fact: 47% of raters said they’d definitely buy the first related book, another 29% said it was good enough that they might buy it, and only 24% of raters disliked the suggestion.

The second and third book suggestions, while a bit worse, seem to perform pretty well too: around 65% of raters rated them positive.

What can we learn from the bad ratings? I ran a follow-up task that asked workers to categorize the bad related books, and here’s the breakdown.

So to improve their recommendations, Amazon could try improving its topic models, add age-based features to its books, distinguish between textbooks and novels, and invest in series detectors. (Of course, for all I know, they do all this already.)

Competitive Analysis

We now have a general grasp of Amazon’s related book suggestions and how they could be improved, and just like we could quote a metric like a CTR of 6.2% or whatnot, we can also now quote a relevance score of 0.62 (or whatever). So let’s turn to the question of how Amazon compares to other online booksellers like Barnes & Noble and Google Play.

I took the same task I used above, but this time asked raters to review the related suggestions on those two sites as well.

In short,

Why are the Play Store’s suggestions so bad? Let’s look at a couple examples.

Here’s the Play Store page for John Green’s The Fault in Our Stars, a critics-loved-it book about cancer and romance (and now also a movie).

Two of the suggestions are completely random: a poorly-rated Excel manual and a poorly-reviewed textbook on sexual health. The others are completely unrelated cowboy books, by a different John Green.

Here’s the page for The Strain. In this case, all the suggestions are in a different language! And there are only four of them.

Once again asking raters to categorize all of the Play Store’s bad recommendations…

So despite Google’s state-of-the-art machine learning elsewhere, its Play Store suggestions couldn’t really get much worse.

Side-by-Sides

Let’s step back a bit. So far I’ve been focusing on an absolute judgments paradigm, in which judges rate how relevant a book is to the original on an absolute scale. This model is great for understanding the overall quality of Amazon’s related book algorithm.

In many cases, though, we want to use human evaluation to compare experiments. For example, it’s common at many companies to:

  1. Launch a “human evaluation A/B test” before a live experiment, both to avoid accidentally sending out incredibly buggy experiments to users, as well as to avoid the long wait required in live tests.
  2. Use a human-generated relevance score as a supplement to live experiment metrics when making launch decisions.

For these kinds of tasks, what’s preferable is often a side-by-side model, wherein judges are given two items and asked which one is better. After all, comparative judgments are often much easier to make than absolute ones, and we might want to detect differences at a finer level than what’s available on an absolute scale.

The idea is that we can assign a score to each rating (negative, say, if the rater prefers the control item; positive if the rater prefers the experiment), and we aggregate these to form an overall score for the side-by-side. Then in much the same way that drops in CTR may block an experiment launch, a negative human evaluation score should also give much cause for concern.

Unfortunately, I don’t have an easy way to generate data for a side-by-side (though I could perform a side-by-side on Amazon vs. Barnes & Noble), so I’ll omit an example, but the idea should be pretty clear.

Personalization

Here’s another subtlety. In my examples above, I asked raters to pick a starting book themselves (one that they read and loved in the past year), and then rate whether they personally would want to read the related suggestions.

Another approach is to pick the starting books for the raters, and then have the rate the related suggestions more objectively, by trying to put themselves in the shoes of someone who’d be interested in the starting item.

Which approach is better? As you can probably guess, there’s no clear answer – it depends on the task and goals at hand.

Pros of the first approach:

Pros of the second approach:

Recap

Let’s review what I’ve discussed so far.

  1. Online, log-based metrics like CTR and revenue aren’t necessarily the best measures of a discovery algorithm’s quality. Items with high CTR, for example, may simply be racy and flashy, not the most relevant.
  2. So instead of relying on these proxies, let’s directly measure the relevance of our recommendations by asking a pool of human raters.
  3. There are a couple different approaches for sending which items to be judged. We can let the raters choose the items (an approach which is often necessary for personalized algorithms), or we can generate the items ourselves (often useful for more objective tasks like search; this also has the benefit of making it easier to derive popularity-weighted or uniformly-weighted metrics of relevance). We can also take an absolute judgment approach, or use side-by-sides.
  4. By analyzing the data from our evaluations, we can make better launch decisions, discover examples where our algorithms do very well or very poorly, and find patterns for improvement.

What are some of the benefits and applications?

  1. As mentioned, log-based metrics like CTR and revenue don’t always capture the signals we want, so human-generated scores of relevance (or other dimensions) are useful complements.
  2. Human evaluations can make iteration quicker and easier. A algorithm change might normally require weeks of live testing before we gather enough data to know how it affects users, but we can easily have a task judged by humans in a couple hours.
  3. Imagine we’re an advertising company, and we choose which ads to show based on a combination of CTR and revenue. Once we gather hundreds of thousands of relevance judgments from our evaluations, we can build a relevance-based machine learning model to add to the mix, thereby injecting a more direct measure of quality into the system.
  4. How can we decide what improvements need to be made to our models? Evaluations give us very concrete feedback and specific examples about what’s working and what’s wrong.

In the spirit of item-item similarities, here are some other posts readers of this post might also want to read.

And finally, I’ll end with a call for information. Do you run any evaluations, or use crowdsourcing platforms like Mechanical Turk or Crowdflower (whether for evaluation purposes or not)? Or do you want to? I’d love to talk to you to learn more about what you’re doing, so please feel free to send me an email and hit me up!

To leave a comment for the author, please follow the link and comment on their blog: Edwin Chen's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.