It might not be a big deal but the difference isn't significant at 95% confidenc...

tansey · on May 30, 2010

Out of curiosity, why?

To me, there is a difference between theoretical significance and practical significance. A 90% likelihood that the second version is actually better than the first version is enough for me to switch.

What is the downside of switching? About 10% of the time you'll be making a change that is no better than the old version. Unless you REALLY love green buttons, I think it's worth the risk. :)

paraschopra · on May 30, 2010

There isn't any downside. But if the test costs are negligible and you can afford to run a test for a week longer, it is always great to do so. I have seen too many tests where confidence level after touching >95% came back to 70% or so once extended the test.

An even better way is to do a follow up test where you do an A/B test where both variations are red. And if you see enough variance in that test, then I don't think you should take results seriously.

When you are testing it is always better to try proving a hypothesis wrong rather than trying to prove it right.

EDIT: clarified some parts.

jules · on May 30, 2010

Perhaps multi armed bandit algorithms can help here. They automatically balance testing to see which version is better with using the best version as much as possible.

A multi armed bandit algorithm is a gambler with a number of levers at his disposal. He chooses which lever to pull and then receives a reward. In this case lever 1 is "show page version A" and lever 2 is "show page version B". The algorithms work so that they balance discovering which lever is best with pulling the best lever.

Here's an example of a very simple algorithm. Record the average profit for page A and page B in two variables. Now with probability p (for example p=95%) choose the page with the highest average profit so far. With probability 1-p pick one at random. A more advanced algorithm could vary p over time so that it starts at 0% increases towards 100%.

http://en.wikipedia.org/wiki/Multi-armed_bandit

nagrom · on May 30, 2010

I suspect that it's because 95% and 99% are familiar numbers from a statistics class, corresponding to 2 and 3 standard deviations in a normal pdf respectively.

In statistical tests, most answers have something to do with standard deviations of the normal distribution, whether it be what 'significant' results are, the error bars on a histogram or the choice of error range on a maximum likelihood fit that has no obvious correlation with normal distributions. (All of these are prevalent in the high energy physics community.)

Statistics are very often used to support 'gut' instincts like that without necessarily understanding the underlying meaning of the mathematics. Happily, it is often the case that approximating everything to a normal and using 'sigmas' is enough to get by.

(May not be the case here, but I deal with it every day at work... </rant>)