A/B ⚖ Testing with Multiple Metrics

Angelina Yang
4 min readDec 1, 2022

A lot of literature references and guidance about A/B testing anchor on tests based on one single comparison or one single metric. Some of my friends working in tech also shared that they typically focus on one primary metric when performing experiment designs (e.g., study design and sample size calculation etc.).

I’m curious if this is a common practice in the tech industry (especially the non-biotech industries).

Do you factor in multiple metrics when performing experiment design at your company?

Yes, factor in more than 1 metric.

No, only focus on 1 primary metric.

What I learned from industry practitioners is that multiple metrics (usually 3 to 5) are monitored even though the sample size might be based on one primary metric.

The issue of multiplicity

There are two types of multiplicity issues in experiments. One is comparing more than two groups (e.g., one control, two different treatments); another is observing more than one metric.

Multiple groups of comparison:

There is a good interview question about this from DataInterviewPro:

We are running a test with 10 variants, trying different versions of our landing page. One treatment wins and the p-value is less than .05. Would you make the change?

This is a multiple testing problem since there are essentially 10 tests involved, i.e, 10 pairs of groups for comparison.

Multiple metrics to track:

This happens when a researcher is interested in more than one outcome in a study. For instance, for a test of the effect of different referral bonuses on user acquisition, we might want to know its effect on more than just conversion rate, but perhaps also on frequency of subsequent purchases or amount of purchases.

While it is often good methodological practice to have more than one measure of the response variable of interest, additional response variables mean more statistical tests need to be conducted on the data set, and this leads to the question of experiment-wise alpha control.

When we have three measurements of the response variable, then three tests are needed to test…

--

--