Leaked UI A/B Tests from Major Websites

londons_explore · on Sept 10, 2020

In many of these cases, the "A/B test" may have been accidental.

Running a software rollout is frequently done slowly, datacenter by datacenter, and during that time some people might see one version and others might see another.

From the users perspective it looks the same as an A/B test, but the difference is nobody was looking at the results...

SahAssar · on Sept 10, 2020

Almost any user-noticable change (that is not a bugfix) is run as a AB-test for a few weeks in my team to verify that it does have the intended impact. I'd be surprised if there aren't teams at google, amazon, netflix and similar organizations that work similarly.

cm2187 · on Sept 10, 2020

I don’t think that in every case it is necessary down to user rejections. I can’t believe that in the netflix case users actually prefer to have to login with two post backs, one for the login, one for the password.

dchest · on Sept 10, 2020

”I can’t believe...” - this is why people run A/B tests.

dogma1138 · on Sept 10, 2020

It wouldn't surprise me if people with active netflix accounts tried to login into that page which is why they changed how it works, also like with any A/B testing it's not always clear what user behaviour they were seeking to change or reinforce, and it's usually more than one.

For example the buy now vs add to cart only on the Amazon one might have been looking at more than just how many products are sold, they might were also been trying to see if they can say reduce impulse buys that result in returns without lowering purchases that do not, in fact the reason they've kept the buy now might be because it actually reduced the return rate as people interacted less with the site and didn't buy additional items that they returned later.

kevin_thibedeau · on Sept 10, 2020

This is a necessary pattern for increased security.

MattGaiser · on Sept 10, 2020

What’s the security difference with two instead of one?

kevin_thibedeau · on Sept 11, 2020

You can have more sophisticated algorithms like SRP that don't send password hashes.

https://en.wikipedia.org/wiki/Secure_Remote_Password_protoco...

rbinv · on Sept 10, 2020

That's actually not a login but a signup form. More people will probably enter anything at all with only one field showing. And since they've already interacted with the site, they might feel inclined to complete the process even after they find out it's multi-step. Some will of course bounce because of it, but it should be net positive overall.

MattGaiser · on Sept 10, 2020

People are probably not registering right away that they are signing up for an account.

ericpauley · on Sept 10, 2020

This appears to be the registration flow, which makes more sense.

anotheryou · on Sept 10, 2020

Would be great for a quiz to guess the result!

gwbas1c · on Sept 10, 2020

"Leaked..."

How are these leaked? Did someone hack into something?

SahAssar · on Sept 10, 2020

They probably mean "detected", as in users noticed something behaving differently between devices or after clearing cookies.

It could also be looking for possible A/B flags in cookies/localStorage.

valuearb · on Sept 10, 2020

Where is the data on results? I looked at an AirBnB “experiment”, they moved an action button above the fold (duh). But no details on how much more effective the move was.

I am all for A/B testing, but the devil is in the details. You can get more users tapping the purchase by moving the purchase button where users are more prone to accidentally tap the purchase button. That doesn’t mean you get more purchases, or that the move was a positive change.

gwern · on Sept 10, 2020

I don't think they have results. It looks like they are regularly scraping sites and looking for diffs across users. So you can say what the test was, how long it ran for roughly, and whether they kept or rejected it, but you have no way of knowing what the quantitative results are (aside from whatever inferences you can make from estimating the tested _n_ and then assuming they are using an efficient testing procedure with optional stopping / bandits plus the final choice to infer upper/lower bounds on the effect size).