Expert Commentary

Successful A/B Tests in Retail Hinge on These Design Considerations

Following a small set of guidelines will result in more meaningful and trustworthy results.

By June Wu

Published on February 19, 2021
min read


At a Glance While the concept of A/B testing is straightforward, planning and execution can go awry if marketers don’t carefully consider several steps. Marketers should thoroughly vet the business hypotheses before designing tests, and ensure that the concepts being tested differ significantly. Other key guidelines include moving beyond randomization to optimize sample selection, and measuring results early in the design phase. Testing that follows these guidelines gives marketers greater confidence in the insights and avoids false conclusions. Marketers at retail companies often use A/B tests to optimize media allocations for different locations or marketing channels, new store layouts or promotions, web designs, and other investments. While the concept of A/B testing is straightforward, planning and execution can go awry if marketers don’t carefully consider several steps. This commentary covers the guidelines that are critical to effective A/B testing. Thoroughly vet the business hypotheses before designing tests Always start with business objectives―what the company is trying to prove or disprove. Each objective should be articulated as a set of practical statements with clear, measurable key performance indicators (KPIs) such as these: Spending $X to market brand A yields higher sales than spending that same amount on brand B, for a given location (the superiority test). New pricing strategy A yields different sales from existing strategy B (inequality test). Store layout A leads to equal sales as layout B (equality test). Web design A leads to the highest web traffic among alternative options B, C, and D (multiple pairwise superiority test). Business objectives need to be measurable so it’s clear before the test starts what will be measured. They also should be practical so that once the test concludes, executives know exactly which actions to take and how. The same business objective often is associated with multiple KPIs, depending on the channels used, purposes of the tests, and other variables. For example, in tests of spending on television ads, companies may care about viewership and viewing time; for online search ads or social media, they typically measure impressions; for email campaigns, they look at open rates, click-through rates, and conversion rates. The common question is what evidence they should collect in order to claim a winner. Test concepts that have real differences For A/B tests to generate meaningful business outcomes, they must create innovative and sometimes fundamentally different offers that will provoke different responses. By contrast, testing variants that are marginally different probably won’t generate meaningful insights. Move beyond randomization to optimize sample selection Testers should allocate the available sample into look-alike testing and control groups. Most marketers do this with randomization. As discussed in a previous commentary, while randomization is sufficient when the sample size is significantly large (at least 10,000 per group), it’s insufficient when dealing with small numbers, as is common for retail tests looking at markets or stores (typically fewer than 100). Fortunately, Bain & Company has created an optimization algorithm that serves this purpose. When the sample size is small, we intelligently allocate each subject into testing and control groups so that all the groups look alike as much as possible. This guarantees that all groups are alike at the baseline, when we don’t do anything to them or we treat all groups equally. Then, if we observe any uplift after the tests where control and test groups are treated differently, we can confidently attribute all the effects purely to our treatment, and don’t have to worry about potential sampling biases. Such “intelligent sample allocation” matters because the uplifts we observe are small, often sales increases of 1% to 3%. A small sampling bias could easily confound our treatments and lead to a false conclusion. Cover all the marketing channels Traditional tests tend to be either offline or online, with few covering both. Yet throughout the Covid-19 pandemic, consumer behaviors have changed significantly, notably with large boosts to online sales and lower sales at many brick-and-mortar stores. In this environment, marketers need to design tests to properly measure the effects of the new behaviors, creating strategies to optimize marketing spending and increase overall sales. This raises new challenges, such as sample representativeness and response-frequency biases between online and offline. Sample selection thus becomes even more important. If all groups are alike, the likelihood of shopping in a store or online should be equal. If any groups have a different propensity to shop online, that could invite certain advertising or marketing tactics. It’s best to start simple and small, running a series of Agile tests to learn something new each time and gradually build confidence. Measure results early in the design phase Marketers frequently start considering how to measure results after the campaign finishes. It’s more useful to include it as part of decision making when setting up the test. Since the feasibility of results measurement dictates what kind of tests can be run, and what sample size is affordable, it should be considered carefully when designing the tests. Flawed test designs can’t be fixed later. Spend adequate time on sample size Sample size and selection is far from a trivial consideration, and deserves the application of the significant body of science. What does sample size mean? To answer this question, first define the unit of analysis―the number of markets, stores, customers, clicks, or other variables. The answer will depend on testing objectives and how one wants to measure results. For targeting purposes, the number of markets or stores or both might be important; for results measurement, what matters is the level of measuring and comparing results, which might be the number of customers or clicks. Determining the right sample size requires settling on several elements: the units of measurement, depending on the KPI types being measured (continuous vs. binary); the kind of comparisons being run (A/B, A/B/n, where “n” is the number of variants tested, multivariate, and so on); the desired significance level (95% or 90% confidence); and the amount of desired effects (10% or 20% uplift). All of these require the proper statistical “power analysis.” In practice, you don’t want to overemphasize statistical significance and thus risk losing opportunities to capture meaningful signals. Sometimes, missing upside potential creates more risks than being safe and not running the test or implementing the results. The technical experts overseeing the science must work closely with business owners who have the domain knowledge to codesign sample size. Trade-offs are often needed, based on what’s feasible. For example, logistical requirements may limit the possible sample size. If the available sample isn’t sufficient for testing purposes, that requires adjustments and loosening of criteria, such as dropping the number of testing variations, demanding a larger uplift, or living with a lower confidence in the results significance. Simply put, the team must be flexible and design practical tests that can deliver meaningful insights for the particular marketing needs. Scope out the test duration Tests in retail are usually longitudinal, which requires consideration of how long to run the tests. All else being equal, there will be trade-offs between test duration and the number of stores or observations per day or week. Once testers know they need a certain number of days or weeks to claim success, they must resist the urge to peek early and p-hack premature results, which could easily lead to false conclusions. They should draw insights only after reaching the minimum duration required and seeing results stabilize. The exception to this rule is a ”multiarmed bandit” test, which includes a specific methodology in which early test results directly affect later test execution. Designing tests is an art that involves lots of science. Done right, testing can lead to insights that open new possibilities with confidence. By contrast, tests that are poorly designed due to various constraints will yield lower confidence in the insights or even false conclusions. The author thanks Bain colleagues Paul Markowitz and Richard Lichtenstein for their review and contributions to this commentary.

At a Glance

While the concept of A/B testing is straightforward, planning and execution can go awry if marketers don’t carefully consider several steps.
Marketers should thoroughly vet the business hypotheses before designing tests, and ensure that the concepts being tested differ significantly.
Other key guidelines include moving beyond randomization to optimize sample selection, and measuring results early in the design phase.
Testing that follows these guidelines gives marketers greater confidence in the insights and avoids false conclusions.

Marketers at retail companies often use A/B tests to optimize media allocations for different locations or marketing channels, new store layouts or promotions, web designs, and other investments. While the concept of A/B testing is straightforward, planning and execution can go awry if marketers don’t carefully consider several steps. This commentary covers the guidelines that are critical to effective A/B testing.

Thoroughly vet the business hypotheses before designing tests

Always start with business objectives―what the company is trying to prove or disprove.

Each objective should be articulated as a set of practical statements with clear, measurable key performance indicators (KPIs) such as these:

Spending $X to market brand A yields higher sales than spending that same amount on brand B, for a given location (the superiority test).
New pricing strategy A yields different sales from existing strategy B (inequality test).
Store layout A leads to equal sales as layout B (equality test).
Web design A leads to the highest web traffic among alternative options B, C, and D (multiple pairwise superiority test).

Business objectives need to be measurable so it’s clear before the test starts what will be measured. They also should be practical so that once the test concludes, executives know exactly which actions to take and how.

The same business objective often is associated with multiple KPIs, depending on the channels used, purposes of the tests, and other variables. For example, in tests of spending on television ads, companies may care about viewership and viewing time; for online search ads or social media, they typically measure impressions; for email campaigns, they look at open rates, click-through rates, and conversion rates. The common question is what evidence they should collect in order to claim a winner.

Test concepts that have real differences

For A/B tests to generate meaningful business outcomes, they must create innovative and sometimes fundamentally different offers that will provoke different responses. By contrast, testing variants that are marginally different probably won’t generate meaningful insights.

Move beyond randomization to optimize sample selection

Testers should allocate the available sample into look-alike testing and control groups. Most marketers do this with randomization. As discussed in a previous commentary, while randomization is sufficient when the sample size is significantly large (at least 10,000 per group), it’s insufficient when dealing with small numbers, as is common for retail tests looking at markets or stores (typically fewer than 100).

Fortunately, Bain & Company has created an optimization algorithm that serves this purpose. When the sample size is small, we intelligently allocate each subject into testing and control groups so that all the groups look alike as much as possible. This guarantees that all groups are alike at the baseline, when we don’t do anything to them or we treat all groups equally. Then, if we observe any uplift after the tests where control and test groups are treated differently, we can confidently attribute all the effects purely to our treatment, and don’t have to worry about potential sampling biases.

Such “intelligent sample allocation” matters because the uplifts we observe are small, often sales increases of 1% to 3%. A small sampling bias could easily confound our treatments and lead to a false conclusion.

Cover all the marketing channels

Traditional tests tend to be either offline or online, with few covering both. Yet throughout the Covid-19 pandemic, consumer behaviors have changed significantly, notably with large boosts to online sales and lower sales at many brick-and-mortar stores.

In this environment, marketers need to design tests to properly measure the effects of the new behaviors, creating strategies to optimize marketing spending and increase overall sales. This raises new challenges, such as sample representativeness and response-frequency biases between online and offline.

Sample selection thus becomes even more important. If all groups are alike, the likelihood of shopping in a store or online should be equal. If any groups have a different propensity to shop online, that could invite certain advertising or marketing tactics. It’s best to start simple and small, running a series of Agile tests to learn something new each time and gradually build confidence.

Measure results early in the design phase

Marketers frequently start considering how to measure results after the campaign finishes. It’s more useful to include it as part of decision making when setting up the test. Since the feasibility of results measurement dictates what kind of tests can be run, and what sample size is affordable, it should be considered carefully when designing the tests. Flawed test designs can’t be fixed later.

Spend adequate time on sample size

Sample size and selection is far from a trivial consideration, and deserves the application of the significant body of science.

What does sample size mean? To answer this question, first define the unit of analysis―the number of markets, stores, customers, clicks, or other variables. The answer will depend on testing objectives and how one wants to measure results. For targeting purposes, the number of markets or stores or both might be important; for results measurement, what matters is the level of measuring and comparing results, which might be the number of customers or clicks.

Determining the right sample size requires settling on several elements:

the units of measurement, depending on the KPI types being measured (continuous vs. binary);
the kind of comparisons being run (A/B, A/B/n, where “n” is the number of variants tested, multivariate, and so on);
the desired significance level (95% or 90% confidence); and
the amount of desired effects (10% or 20% uplift).

All of these require the proper statistical “power analysis.” In practice, you don’t want to overemphasize statistical significance and thus risk losing opportunities to capture meaningful signals. Sometimes, missing upside potential creates more risks than being safe and not running the test or implementing the results.

The technical experts overseeing the science must work closely with business owners who have the domain knowledge to codesign sample size. Trade-offs are often needed, based on what’s feasible. For example, logistical requirements may limit the possible sample size. If the available sample isn’t sufficient for testing purposes, that requires adjustments and loosening of criteria, such as dropping the number of testing variations, demanding a larger uplift, or living with a lower confidence in the results significance.

Simply put, the team must be flexible and design practical tests that can deliver meaningful insights for the particular marketing needs.

Scope out the test duration

Tests in retail are usually longitudinal, which requires consideration of how long to run the tests. All else being equal, there will be trade-offs between test duration and the number of stores or observations per day or week. Once testers know they need a certain number of days or weeks to claim success, they must resist the urge to peek early and p-hack premature results, which could easily lead to false conclusions. They should draw insights only after reaching the minimum duration required and seeing results stabilize. The exception to this rule is a ”multiarmed bandit” test, which includes a specific methodology in which early test results directly affect later test execution.

Designing tests is an art that involves lots of science. Done right, testing can lead to insights that open new possibilities with confidence. By contrast, tests that are poorly designed due to various constraints will yield lower confidence in the insights or even false conclusions.

The author thanks Bain colleagues Paul Markowitz and Richard Lichtenstein for their review and contributions to this commentary.

Successful A/B Tests in Retail Hinge on These Design Considerations

Successful A/B Tests in Retail Hinge on These Design Considerations