Hi, I'm one of the creators of http://www.abtests.com. The issue of statistical significance has come up over and over, so I'll try to explain our view of it.
We ask people to input their raw data...both trials and conversions. If they do this honestly (anybody can fake data about anything) then in our view the results speak for themselves. We've had folks upload data that was obviously not statistically significant, and we've had people write blog posts denouncing those results. We've also had folks upload test data that was statistically significant and people say they're learning a lot.
So we've had both solid and suspect data uploaded to the site with good discussion around it. This is exactly what we hoped for...I think in the future as more tests get uploaded the wheat will be separated from the chaff, so to speak, and those tests with significant data will get lots more attention than those that don't. In fact, we're already seeing this in the traffic logs.
And, as several folks have mentioned, many tools do the hard stats math for you, telling you when your data is statistically significant. This helps people know when they can be confident in sharing their data with others.
Doing the math here. A/B Tests with conversions are modeled as binomial variables. So the standard error of the conversions here is sqrt(p(1-p)/n) where p is conversion rate and n is number of hits (p(1-p) is standard deviation of binomial distribution). Calculating standard error for both of your versions - sqrt(0.002*(1-0.002)/2834) = 0.0008 and for the other SE is 0.0017. Now since there are large number of trials, you can model the difference of two binomial distributions as a normal distribution, standard deviation of whose is sqrt(se_1^2 + se_2^2) = 0.0019.
Now the way significance is checked is by using single tailed z score (we are testing if the difference in two distributions is statistically significant and greater than zero). Z score in this case is p_1 - p_2/std that is (0.008-0.002)/0.0019 = 3.1579 which is way larger than the critical value of 1.65 (which corresponds to 95% confidence).
So, the difference is indeed statistically significant. A note of caution is that some theory says that you cannot model a binomial distribution as a normal distribution until you have at least 10 successes or failures, which is the case here.
See my reply lower in the thread - I worked out the numbers using Bayesian inference to find the exact probability that B is better than A, subject to a number of assumptions. The benefit of this approach is that it's exact so you don't need a certain number of samples to properly approximate a normal distribution. The answer is that B is almost certainly better than A. Here's the calculation I plugged into Wolfram Alpha:
I'm extremely rusty on my statistics, and I upvoted this because I find A/B tests interesting, but... are these numbers statistically significant? For the population of internet users, are they actually practically significant? It just seems like the sample sizes and differences aren't really big enough to draw solid conclusions from. It doesn't say how long each test lasted for, either--what if the second test was done during peak hours?
This is exactly why I don't read other people's A/B experiment results. I haven't seen a single A/B experiment that listed it's statistical significance (and how that was calculated). I fear that bad data is worse than no data at all.
Google Website Optimizer will give you a plus minus rate for your estimated conversion rate, which helps you to at least estimate what the significance/confidence is.
For example, one of the pages in one of my current tests shows up as "Est. Conversion Rate: 17.0% +/- 1.4%".
2 vs 6 orders doesn't seem like enough. There are people in affiliate programs that don't change anything and yet one day they have 20 orders and 0 the next day. It seems like something doesn't work, but it's back to normal next day - just a fluctuation in stats. I really wouldn't consider 2 vs 6 orders a significant sample.
Even if this has a G-test significance of 99.86%, it doesn't mean it is valid. You need to have AT LEAST 10 results for the formulas to work correctly. Also, saying 300% conversion is madness. I'm not even fully statistically convinced that the challenger is at ALL better, let alone a certainty of 300% better.
If you model the two alternatives as Bernoulli processes with unknown success rates, and assume that the only difference between the two is what is specified on the page, and that they don't interact, and you assume a uniform prior on both parameters, B's conversion rate is higher than A's with probability 0.999572.
We ask people to input their raw data...both trials and conversions. If they do this honestly (anybody can fake data about anything) then in our view the results speak for themselves. We've had folks upload data that was obviously not statistically significant, and we've had people write blog posts denouncing those results. We've also had folks upload test data that was statistically significant and people say they're learning a lot.
So we've had both solid and suspect data uploaded to the site with good discussion around it. This is exactly what we hoped for...I think in the future as more tests get uploaded the wheat will be separated from the chaff, so to speak, and those tests with significant data will get lots more attention than those that don't. In fact, we're already seeing this in the traffic logs.
And, as several folks have mentioned, many tools do the hard stats math for you, telling you when your data is statistically significant. This helps people know when they can be confident in sharing their data with others.