In an earlier post, we talked about the difference between A/B tests and multivariate tests. Let’s look in more detail at a basic testing issue, sample size.
We have a simple power-analysis spreadsheet that calculates the sample size needed for in-market tests. The spreadsheet takes four inputs:
- The expected response rate. This is what portion of those that receive treatment will respond to the test, usually for the baseline or control cell.
- The lift we want to measure—how much higher (say, a 25% higher response rate) a successful test cell or offer or attribute level will be than the baseline or control.
- Two measures, alpha and beta, relate to the precision and recall of the test. It’s usually 5% for alpha (corresponding to 95% confidence that an observed positive result is not due to chance) and 80% for beta (corresponding to an 80% chance of finding an effect if there really is one, or a 20% chance of a false negative).
These four numbers generate a sample size estimate. For an A/B test, this is the sample size needed per cell for the test. If you have a two-cell test (the standard A/B or test/control), you multiply the number produced by the formula by two. With three cells, you multiply by three, and so on.
Things get interesting when we apply the calculation to a multivariate test. This type of test is much more efficient, given that each attribute level appears in multiple test cells. When determining the impact of changing an attribute from one level to another, we can combine across cells (controlling statistically for how they differ from each other in other attributes). So our effective sample size becomes much bigger than just two cells. As a result, the sample size multiplier does not depend on the number of cells.
In fact, the sample size is only a function of the number of levels in the attribute with the most levels. If your multivariate test consists of five two-level attributes and one four-level attribute, you multiply the number calculated in the formula by four. You still multiply by four even if you have six four-level attributes in your test. If you have an eight-level attribute, it doesn’t matter if the rest of the attributes are two-level or four-level; you still multiply by eight. This gives you the total sample size.
We take the total sample size and divide by the number of cells in the test to get the sample size per cell. Note that this is much lower than the per-cell sample size from an A/B test. And we can use that same sample size for a control, so the test cells are always smaller than the control cell. Hence the gain in efficiency from the multivariate test.
The drivers of sample size do not include factors that we normally expect. We can keep adding attributes and test cells without affecting the total sample size needed. Each test cell just gets smaller. However, if we increase the number of levels in the key attribute, we pay a big price in sample.
There must be a limit to how far one can push this logic. At some point, things will break down. Otherwise, tests could run with 100 four-level attributes and over 300 test cells, each of which has very few cases in it. In our experience, however, we have yet to reach the limit. Practical implementation considerations limit the number of cells in a test, and the limit they impose is lower than the practical limit of power analysis.
Paul Markowitz is a principal in Bain & Company’s Advanced Analytics practice. He is based in Boston.