Evan Miller has a point, but not as good of one as he thinks.
It is true that multiple peeks mean that eventually any test will find
significance at any level you want. However in A/B tests the peeks
are not independent. This greatly weakens the effect he is talking
about.
Section 7 of my presentation, starting at
http://elem.com/~btilly/effective-ab-testing/#slide59, is about the
question of how long it takes for a test to complete. For that I ran
numerical experiments with constant peeking, literally every time you
add one to A and one to B you peek again. You can see graphs of how
many errors there were, and how long it takes to get an answer.
Here are key points:
- Be suspicious of tests that end quickly. Run them a bit longer on
general principal. (In general I'd call 500 people a very small
test.)
- Nobody can predict how long a test will take. Even if you know the
actual improvement, you still can't predict time to within an order of
magnitude.
- If a test has been running for a long time you know the true
difference is small, so there is no harm in accepting whatever answer
it gives.
Here is an excerpt from an email discussion I had recently that touched on http://www.evanmiller.org/how-not-to-run-an-ab-test.html.
Evan Miller has a point, but not as good of one as he thinks.
It is true that multiple peeks mean that eventually any test will find significance at any level you want. However in A/B tests the peeks are not independent. This greatly weakens the effect he is talking about.
Section 7 of my presentation, starting at http://elem.com/~btilly/effective-ab-testing/#slide59, is about the question of how long it takes for a test to complete. For that I ran numerical experiments with constant peeking, literally every time you add one to A and one to B you peek again. You can see graphs of how many errors there were, and how long it takes to get an answer.
Here are key points:
- Be suspicious of tests that end quickly. Run them a bit longer on general principal. (In general I'd call 500 people a very small test.)
- Nobody can predict how long a test will take. Even if you know the actual improvement, you still can't predict time to within an order of magnitude.
- If a test has been running for a long time you know the true difference is small, so there is no harm in accepting whatever answer it gives.