That thread is seriously wrong. Here is an excerpt from an email discussion I ha...

That thread is seriously wrong.

Here is an excerpt from an email discussion I had recently that touched on http://www.evanmiller.org/how-not-to-run-an-ab-test.html.

Evan Miller has a point, but not as good of one as he thinks.

It is true that multiple peeks mean that eventually any test will find significance at any level you want. However in A/B tests the peeks are not independent. This greatly weakens the effect he is talking about.

Section 7 of my presentation, starting at http://elem.com/~btilly/effective-ab-testing/#slide59, is about the question of how long it takes for a test to complete. For that I ran numerical experiments with constant peeking, literally every time you add one to A and one to B you peek again. You can see graphs of how many errors there were, and how long it takes to get an answer.

Here are key points:

- Be suspicious of tests that end quickly. Run them a bit longer on general principal. (In general I'd call 500 people a very small test.)

- Nobody can predict how long a test will take. Even if you know the actual improvement, you still can't predict time to within an order of magnitude.

- If a test has been running for a long time you know the true difference is small, so there is no harm in accepting whatever answer it gives.