A/B testing pitfalls and lessons learned

btilly · on Jan 31, 2011

The advice is mixed in quality.

The worst piece of advice is to only use one metric, which is some complicated mix of other metrics. The basic reason they want this is to give a clear go/no go signal that everyone agrees on. Perhaps if you have to deal with the politics of a larger organization, that's a good idea. But if you're a small company, the extra detail you get about how your product is used from tracking multiple metrics is very good for helping clarify what you're trying to do, and how you want to do it.

Furthermore the act of creating a complex weighted measure is pushing the argument elsewhere. And when you're still trying to figure out how your site is actually performing, you don't have the context to know what measure to use. Furthermore you won't be able to use the obvious chi-square test (or its better relative, the g-test). There is no need to over-complicate the statistics.

The idea of having a hashing function to do test assignment is one that I had not considered. I've always suggested the obvious rand() at assignment time approach, which accomplishes the same thing but with more overhead at run time. I'd caution people who try the hashing approach to use a standard library, because it would be really, really easy to have the website think that assignment is done one way while your analysis assumes that it is done in another.

The minimum duration point is interesting...and somewhat useless. When I was preparing my presentation a few years ago I found out that, even if you know exactly how much better A is than B, you can't predict to within an order of magnitude how quickly your experiment will show it. My attitude is the much simpler, "The test takes however long it will take, and you can't really know how long that will be in advance." After you've done a few tests, people will have a good enough idea for a back of the envelope estimate.

The other advice seemed good, and mostly was obvious to me. But I have more experience with A/B testing than most do.

patio11 · on Jan 31, 2011

Hashing has one major advantage over random assignment: reproducibility. If you have some way of IDing Mary Smith, you can consistently serve her the same tests on all machines, and expose the tests she is participating in to bug investigators, without having to actually store a users => chosen test alternatives map anywhere. Those can get fairly sizable and the access patterns suck.

A/Bingo does it by hashing taking MD5(user_id . test_name). You have to store user_I'd, but you're doing that anyhow.

btilly · on Jan 31, 2011

If you do random assignment, and save the random assignments somewhere, then you can also get reproducibility. If you do random assignment, and don't save the random assignments anywhere, then you have no idea how many people were in your test versions. Which is not a good idea. (You can estimate this data. I've done it. But doing it properly is surprisingly tricky. It is very, very easy to do it wrong.)

There are several benefits to this approach.

The first benefit is that if you're testing a particular page, you can easily make your test only include people who have hit that page. This will cause results to converge more quickly than if you don't know which people on your site actually hit that page.

The second, and sometimes critical, benefit is that you know exactly when someone entered the test. Multiple times when testing things with a longer sales cycle I've encountered the situation where a particular test version causes the sales cycle to become compressed, but may or may not provide a long-term improvement in conversion. Access to data about when people entered your test allows you to examine A/B test results only for people who have been in the test long enough to be likely to have completed either version.

A third benefit is that if you're testing multiple versions, then you can just continue the test and drop poorly performing versions as you prove that they are suboptimal.

The downside, as you say, is that the map of who is in what version can get very large. In my experience, though, the access patterns are not that bad. Particularly not if you are already using sessions, and can cache that information in the current session, so that most page hits don't have to fetch the A/B test version. Furthermore this is not data that you need to join to anything else on your live website. Therefore it is a perfect candidate to move somewhere like Redis.

coryl · on Jan 31, 2011

What were you testing, if I may ask?

btilly · on Jan 31, 2011

I've tested a lot of things. See http://elem.com/~btilly/effective-ab-testing/ for a tutorial I taught on the topic.

mwexler · on Jan 31, 2011

The MS site (http://exp-platform.com/) has more of Kohavi's papers on how MS uses experimentation across their systems. Worth a perusal if you want more depth than this overview document.