Announcing Evan's Awesome A/B Tools

gostevehoward · on May 13, 2013

These are great! Thank you Evan. Your sample size calculator is wonderful and beats the hell out of the 90s tool I've been using which cautions me that my browser must support "JavaScript" to use it :)

As an alternative to the Chi-squared calculator, people might want to check out ABBA, a tool I wrote here at Thumbtack:

http://www.thumbtack.com/labs/abba/

It shares the visual component and the linkability, two great features you've nailed. It lacks the live updating and the slider, which is really cool and something I've wanted to add to ABBA for a long time. On the other hand, it supports multiple groups compared against the baseline simultaneously and incorporates a correction for multiple testing into its p-values and confidence intervals, which can be handy. It also uses different mathematics under the hood, but that's not going to be a concern for most users.

Glad to see another step towards a more statistically-aware world!

yahelc · on May 13, 2013

You'll be interested to hear that the digital analytics team at the Obama campaign sometimes made use of your tool for sharable A/B calculations.

I even built a bookmarklet for quickly grabbing numbers off off of a page (usually for Google Analytics data) and passing them to your calculator. https://github.com/yahelc/ABBA-bookmarklet

gostevehoward · on May 13, 2013

Wow this bookmarklet is awesome! I'm going to start using it. Thanks!

gostevehoward · on May 13, 2013

Small follow-up question: I'm curious how your sample size calculator chooses its value. Traditional power analysis asks for two proportions rather than for a proportion and an effect size. So the sample size for a positive effect differs from that for a negative effect (from the same baseline). I'd imagine your tool would conservatively present the larger of the two. However, it seems to present something near the midpoint of the two. Is that the intention? Or is there some other statistics being used here?

(For example, with an 8% baseline conversion rate, a 1% absolute detectable different, 85% power and 10% significance level, your tool says 10,583 per branch. R's `power.prop.test` gives sample sizes of 11,182 for a positive change (8% vs 9%) and 9,974 for a negative change (8% vs 7%). The exact midpoint is 10,578.)

mcfunley · on May 13, 2013

You scooped mine by like 20 minutes, which is weird, so I'll just put this here:

http://www.experimentcalculator.com/

*edit: yours is awesome. Nice work.

arkitaip · on May 13, 2013

I tried and liked the simplicity of yours! I also learned that as a small ecommerce - 5k/day - you may never have enough visitors to effectively run reliable tests :(

drewda · on May 13, 2013

That's also why experimental psychologists have mixed feelings about running power analyses* to figure out how many participants they'll need in a study to to yield statistically meaningful results--it's almost always a humbling high figure.

* http://en.wikipedia.org/wiki/Statistical_power

mcfunley · on May 13, 2013

Depends: if you're small, then you might have low-hanging fruit and bigger conversion increases might be possible. If that happens to be true then measuring things is tractable.

bryanh · on May 14, 2013

Great tools!

My goto reference is still the wonderful btilly presentation about how to A/B test properly, with nice examples and code snippets: http://elem.com/~btilly/effective-ab-testing/

He provided a full on javascript tool that isn't as polished but works great: http://elem.com/~btilly/effective-ab-testing/g-test-calculat...

binarysolo · on May 14, 2013

As a math/stats/data person who doesn't dabble much in web optimization -- can someone explain to me what's awesome about this?

Not to belittle this nice package -- it looks like a basic stats calculator for calculating sample size confidence levels with friendly visualization, and I'm just trying to understand what's being valued on the market/industry right now. Is it because current A/B testing software doesn't provide these basic calculations? Or is it that it's well presented and visualized to a lay crowd?

christopheraden · on May 14, 2013

Like all of A/B testing, it's applying a _very_ old statistical method (Chi-Square was one of the first modern statistical techniques--by that I mean it's 113 years old) to an area where statistics has not commonly been used. This makes it seem wonderful and novel as countless people suddenly realize that statistics can be applied to fields that were previously untouched by quantitative analyses.

The statistics being used in the A/B testing world is stuff you would've learned in your very first statistics class, it seems. From the success of Optimizely and VWO that the focus is definitely more on the viz and presentation than it is on using any cutting-edge techniques.

binarysolo · on May 14, 2013

Gotcha, and thanks -- it just seemed trivial and I was under the (false) assumption that confidence levels and selecting appropriate sample size should be common knowledge, given how much polls are used in day-to-day life.

Good to know there's plenty of opportunity to bring better stats to high tech. Of course, I understand a lot of the value comes from making those things applicable and meaningful to the users...

christopheraden · on May 15, 2013

Power analysis and CI's should be elementary, but I would assert that they are actually not commonplace. Most people have a very surface-level understanding of the latter, and little understanding of the former. In my opinion, A/B Testing has actually done a great service to power analysis. I have seen many experiments in the academic world (social sciences are somewhat notorious for this) forgoing the power analysis for various reasons (fear: they would not be able to get the sample size needed for 80% power, inability to control sample size: you take whatever you can get with a convenience sample). As a statistician, I breathe a sigh of relief with the amount of emphasis power analysis receives in the A/B world. It's a step in the right direction (if you're an acolyte to the dark world of Neyman-Pearson).

As for bringing better stats to high tech, I've thought of this as a wonderful challenge. I'd especially like to see more focus on not violating modeling assumptions (more non and semi-parametrics), and using some more modern techniques from the ML and Bayes literature.

Hypothesis testing is so last century :). Would love to discuss it further with some similarly-inclined HN folks.

Sorry for all the parentheticals. You'd think I was a lisp programmer with the amount of parenthesis I used.

aresant · on May 13, 2013

Evan these are incredible tools, thank you for contributing another brick in the wall for those of us that bleed A/B testing :)

I would love to see Optimizely and VWO embrace similarly non-ambiguous and functional reporting as a default.

EG - just introducing Chi-Squared testing into a discussion with clients or teams that think that they're A/B testing properly by following Optimizely's graphs usually turns the discussion on it's head - "you mean there's a RANGE? well how can we be certain?" etc.

Great work, thank you!

timr · on May 13, 2013

All of the tools used to have this sort of presentation. GWO, in particular, once had a very nice visualization of confidence-range overlaps (the current product blows).

I suspect they removed them for the same reason I tend to avoid the subject when discussing testing with non-technical people: nearly everyone is numerically illiterate, and looking for the "easy" answer. They want a used-car salesman, not a mathematics professor (i.e. "no more of this 'confidence' gobbeldygook -- give me the bottom line").

Sad, but in my experience, nearly universally true.

archildress · on May 13, 2013

Curious - for someone statistically ignorant like me, can you provide some detail on how we can use Evan's tools for split testing improvements?

I guess the question I'm boiling this down to is... Why are graphs and comparisons of results that Optimizely or VWO produces not good enough?

aresant · on May 13, 2013

The short answer is that they don't provide much depth.

I've seen Optimizely call something a "winner" with 95% confidence after 48hrs.

The triangular method we use w/the off-the-shelfers is something like:

a) Optimizely base stats

b) Convergence point analysis (useful to correct for day-of-week / unique traffic swings)

c) Chi-Squared Testing which provides a range so that you can actually assess the risk of a high-confidence test. eg look at the example in evan's tool which shows 8.5%-22% and 13.%-28.9%. This means that Sample 1 could be as HIGH as 22.1% conversion and Sample 2 could be as LOW as 13.3%. If this was rated as a high-confidence test that Sample 2 was rated higher than Sample 1 you could be potentially risking a signifigant conversion decrease if you went with Sample 2. EG needs more data and don't just buy into the "this one is better"

christopheraden · on May 14, 2013

"Use a dedicated statistical package from the '80s" Is your Wizard app not also a dedicated statistical package? Also, I'm being pedantic here, but how many "dedicated statistical packages" are actually from the 80s? The only ones that come to mind are Stata and Statistica.

Is there a way to view the source code or formulas you use on your pages? There's been a strong push in the academic statistics world for reproducible research, which means public data, open source statistical code.

I ask because I'm curious about your two-sample t-test. Does it pool the variances for all values of the two standard deviations? Pooling doesn't make sense when one sd is 50 and the other is 2...

WA · on May 13, 2013

That is really awesome. One suggestion though: Make "relative" the default as well as set "1−β" to 90 or 95%.

I assume that most people have a conversion rate of X, say 30%, and want to increase this by Y, say 20% (30% to 36%). If I consider the type of headline many blog posts and reports have, they are like "How we increased sales, trials, whatever by 50%". That's how they think.

And well, I usually aim for 95% significance.

twog · on May 13, 2013

Hey Evan,

We met a few times during last years Gig Tank (I am one of the cofounders of http://banyan.co). Awesome to see you killing it. Are you planning on coming back to Chattanooga anytime soon? Would love to grab beers. My email is in my profile, and I would love to reconnect.

pwr · on May 14, 2013

Are there any usefull resources for learning the statistic methods needed for A/B test evaluation for beginners?

viktorsr · on May 13, 2013

MixPanel's Split Test calculator supports multiple groups, but doesn't have anything visual:

https://mixpanel.com/labs/split-test-calculator

They use two-proportion z-test.

heliostatic · on May 13, 2013

These look great. For the sake of the permalinks, I'd love to see these hosted on another ___domain. Maybe a Wizard app ___domain, to increase brand awareness?

cpsaltis · on May 14, 2013

A github page would also be a good idea