Trackly SMS | Blog
Testing & Optimization

SMS A/B Testing Sample Size and Statistical Significance Explained

Trackly SMS ·

Tags: sms a/b testing, sample size, statistical significance, thompson sampling, campaign optimization, sms marketing

Running an SMS A/B test is straightforward. Knowing when to call a winner is not. Too many marketers glance at early results, see one variant outperforming another by a fraction of a percent, and declare victory after a few hundred sends. The problem is that small samples produce noisy data, and noise masquerading as signal leads to bad decisions. Understanding SMS A/B testing sample size requirements and statistical significance is what separates rigorous optimization from expensive guesswork.

This guide covers the math, the practical shortcuts, and the mental models needed to design conclusive SMS tests — from calculating the sample size you actually need, to interpreting results correctly, to understanding how modern algorithmic approaches can handle much of this complexity automatically.

Why Sample Size Matters in SMS A/B Testing

Every A/B test is a statistical experiment. The goal is to determine whether the difference in performance between two (or more) message variants reflects a real, repeatable difference or is simply random variation. Sample size is the single most important factor in making that determination reliably.

With too few observations, your test lacks statistical power — the ability to detect a real difference when one exists. With too many, you waste sends and delay decisions. The objective is to find the minimum sample size that gives you confidence in the result.

The Two Errors You Can Make

Statistical testing frames the problem around two types of mistakes:

Both errors carry costs. A false positive means you adopt a variant that is not actually better, potentially degrading future campaign performance. A false negative means you miss a genuine improvement and leave revenue on the table.

The Core Formula: How to Calculate SMS A/B Test Sample Size

For a two-proportion z-test — the standard framework for comparing click-through rates or conversion rates between two groups — the minimum sample size per variant is:

n = (Zα/2 + Zβ)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₁ − p₂)²

Where:

Memorizing this formula is unnecessary. What matters is understanding the four inputs that drive the result: your baseline rate, the minimum effect size you care about, your significance level, and your desired power.

Worked Example: Click-Through Rate Test

Suppose your current SMS click-through rate (CTR) is 4.0%, and you want to detect a lift to 5.0% — a 1 percentage point absolute improvement (25% relative lift). Using α = 0.05 and power = 0.80:

That means you need roughly 6,734 messages per variant, or about 13,468 total sends for a two-variant test. If you want to detect a smaller lift — say, half a percentage point — the required sample size roughly quadruples.

Sample Size Reference Table for Common SMS Scenarios

The following table shows approximate sample sizes per variant for common baseline rates and minimum detectable effects, using α = 0.05 and 80% power.

Baseline RateMinimum Detectable Lift (Absolute)Minimum Detectable Lift (Relative)Sample Size Per VariantTotal Sends (2 Variants)
2%+1 pp (to 3%)+50%~3,200~6,400
2%+0.5 pp (to 2.5%)+25%~13,500~27,000
4%+1 pp (to 5%)+25%~6,700~13,400
4%+0.5 pp (to 4.5%)+12.5%~28,000~56,000
6%+1.5 pp (to 7.5%)+25%~4,200~8,400
6%+1 pp (to 7%)+16.7%~9,200~18,400
10%+2 pp (to 12%)+20%~5,500~11,000
10%+1 pp (to 11%)+10%~22,000~44,000

Two patterns stand out. Lower baseline rates require larger samples because there is less signal relative to noise. And the smaller the effect you want to detect, the dramatically larger the sample you need. This is the fundamental tradeoff in test design.

The Three Levers You Can Actually Control

If the required sample size exceeds your available audience, you have three options. Each involves a deliberate tradeoff.

1. Accept a Larger Minimum Detectable Effect

Instead of trying to detect a 0.5 percentage point lift, design your test to detect a 1.5 or 2 point lift. This means you will only identify large improvements, but you can do it with a much smaller audience. For many SMS campaigns, a variant that does not produce at least a 1-point CTR improvement may not be worth adopting anyway, making this a reasonable concession.

2. Lower Your Confidence Threshold

Moving from 95% confidence (α = 0.05) to 90% confidence (α = 0.10) reduces the required sample size by roughly 15–20%. This is a legitimate choice for low-stakes tests — for example, testing two similar CTAs where either outcome is acceptable. It is less appropriate for tests that will inform a long-running campaign strategy.

3. Test Bigger Differences

Rather than testing incremental copy tweaks, test radically different approaches: a completely different offer, a different message structure, or a different value proposition. Larger real-world differences produce larger observed effects, which require smaller samples to detect. This aligns with the advice in our guide on SMS A/B testing beyond message copy, where testing CTAs, offers, timing, and segments often produces bigger lifts than minor wording changes.

When to Check Results: The Peeking Problem

One of the most common mistakes in A/B testing is checking results repeatedly during the test and stopping as soon as one variant looks like a winner. This practice, known as "peeking," inflates your false positive rate well beyond the 5% you intended.

If you check a test at 10 different points during its run, each check is an opportunity to observe a random fluctuation and mistake it for a real effect. Research has shown that continuous monitoring of a standard fixed-horizon test can inflate the actual false positive rate to 20–30%, even when using a 5% significance threshold.

Solutions to the Peeking Problem

The Bayesian approach is particularly well-suited to SMS marketing, where campaigns often have natural time constraints and marketers need to make decisions quickly. This is the foundation of algorithmic creative selection, discussed in a later section.

Statistical Significance: What It Actually Means (and What It Does Not)

A result is "statistically significant at the 95% level" when the observed difference between variants would occur by chance less than 5% of the time if there were no real difference. That is the full extent of the claim. It does not mean there is a 95% probability that the winning variant is actually better. It does not mean the effect size is large or meaningful. It does not mean the result will replicate in every future campaign.

Statistical Significance vs. Practical Significance

With a large enough sample, even a trivially small difference can be statistically significant. If you send 500,000 messages per variant, you might detect a 0.05 percentage point CTR difference with high confidence. But a 0.05-point lift is operationally meaningless — it will not move the needle on revenue or ROI.

Always pair statistical significance with a practical significance threshold: the minimum improvement that justifies the effort of implementing the change. For most SMS campaigns, this falls somewhere between 0.5 and 2 percentage points of CTR, depending on volume and economics.

Confidence Intervals Are More Useful Than P-Values

Rather than fixating on whether a result is "significant" or not, examine the confidence interval around the observed difference. A 95% confidence interval of [+0.3 pp, +1.7 pp] tells you much more than a bare p-value. It communicates the plausible range of the true effect, allowing you to judge whether even the lower bound is large enough to matter.

If the confidence interval includes zero, the test is inconclusive. If the entire interval is above your practical significance threshold, you have a clear winner. If the interval is positive but includes values below your threshold, the result is statistically significant but may not be practically meaningful.

Real-World Scenario: Planning a Test With a 50,000-Contact List

The following walkthrough illustrates a realistic planning exercise. Assume you have a list of 50,000 opted-in subscribers and want to test two message variants for an upcoming promotional campaign.

Step 1: Establish Your Baseline

Review your last several campaigns to determine your typical CTR. Suppose it is 3.5%. If you do not have historical data, start with a conservative estimate. Our SMS A/B testing guide covers how to establish reliable baselines.

Step 2: Define Your Minimum Detectable Effect

Ask what the smallest improvement worth knowing about would be. Given a 3.5% baseline, a lift to 5% (1.5 percentage points, or ~43% relative) represents a meaningful improvement. A lift to 4% (0.5 points, ~14% relative) would be useful to know but harder to detect.

Step 3: Calculate Required Sample Size

Using the formula with p₁ = 0.035, p₂ = 0.05, α = 0.05, and 80% power, the required sample per variant is approximately 3,800. For two variants, that is 7,600 total sends — well within your 50,000-contact list.

Step 4: Decide on Test Allocation

Step 5: Run the Test and Wait

Send the messages and wait for a sufficient click window. For SMS, most clicks happen within the first 2–4 hours, but allowing 24 hours captures late responders. Do not check results at the 30-minute mark and make a call.

Step 6: Analyze Results

Suppose Variant A gets a 3.4% CTR (170 clicks out of 5,000) and Variant B gets a 4.8% CTR (240 clicks out of 5,000). The observed difference is 1.4 percentage points. Running a two-proportion z-test yields a p-value of approximately 0.0006, well below 0.05. The 95% confidence interval for the difference is roughly [+0.6 pp, +2.2 pp].

The entire interval is above zero and above a reasonable practical significance threshold. Variant B is the winner. Send it to the remaining 40,000 contacts with confidence.

Multi-Variant Tests: More Variants Means More Data

Testing more than two variants is common in SMS, especially when experimenting with different offers, CTAs, or message structures. Each additional variant increases the total sample size required for two reasons.

First, you need enough data for each variant individually. If you need 5,000 per variant and you are testing four variants, that is 20,000 sends just for the test phase. Second, testing multiple comparisons increases the probability of a false positive. With four variants, there are six pairwise comparisons. If each comparison has a 5% false positive rate, the probability of at least one false positive across all comparisons is roughly 26%. This is the multiple comparisons problem.

Corrections for Multiple Comparisons

For most SMS marketers testing 2–4 variants, the Bonferroni correction is sufficient and straightforward to apply. Be aware that it increases the effective sample size requirement because you are using a stricter threshold.

Algorithmic Creative Selection: Letting the Math Run Itself

Traditional A/B testing follows a rigid structure: define sample size, split traffic evenly, wait, analyze, pick a winner. This works, but it has a notable drawback — during the test phase, half your traffic goes to the losing variant. This is the exploration cost, and for high-volume SMS senders, it can be substantial.

Multi-armed bandit algorithms offer an alternative. Instead of splitting traffic evenly, they dynamically allocate more traffic to variants that are performing well, while still sending some traffic to underperforming variants to confirm they are truly worse. The most common approach is Thompson Sampling, a Bayesian method that balances exploration and exploitation by sampling from the posterior distribution of each variant's performance.

How Thompson Sampling Works in SMS

  1. Start with a prior belief about each variant's click rate (typically a uniform or weakly informative Beta distribution).
  2. For each message to be sent, draw a random sample from each variant's current posterior distribution.
  3. Send the variant with the highest sampled value.
  4. Observe the outcome (click or no click) and update the posterior distribution for the variant that was sent.
  5. Repeat.

Early in the test, when there is little data, traffic is distributed roughly evenly because the posterior distributions overlap heavily. As data accumulates and one variant pulls ahead, the algorithm naturally shifts more traffic to it. A variant that is clearly worse might receive only 10–15% of traffic by the end of the campaign, rather than the 50% it would get in a traditional A/B test.

Trackly's ML-powered algorithmic creative selection uses this approach. Rather than requiring marketers to pre-calculate sample sizes and manually analyze results, the system continuously evaluates creative performance and shifts allocation toward top-performing messages. This does not eliminate the need to understand the statistics — knowing how sample size, effect size, and confidence interact helps you design better creative variants and set realistic expectations — but it automates the mechanical parts of test execution.

When to Use Traditional A/B Tests vs. Algorithmic Allocation

FactorTraditional A/B TestAlgorithmic (Bandit) Allocation
Primary use caseLearning and documentationMaximizing campaign performance
Traffic splitFixed (e.g., 50/50)Dynamic, shifts to winner
Exploration costHigherLower
Statistical frameworkWell-understood frequentist modelBayesian, requires different interpretation
Ease of interpretationClear winner/loser with p-valueProbability of being best
Ideal list sizeAny (with proper sizing)Larger lists benefit more
Peeking problemYes, if not pre-committedNo — continuous monitoring is built in

In practice, many teams use traditional A/B tests for foundational learning ("which message structure works for this audience?") and algorithmic allocation for ongoing optimization within campaigns.

Common Mistakes That Invalidate SMS Test Results

Even with correct sample sizes, several practical errors can undermine your results.

Non-Random Assignment

If your test groups are not randomly assigned, any observed difference might be due to the groups themselves rather than the message variants. For example, if Variant A goes to contacts imported last month and Variant B goes to contacts imported six months ago, differences in engagement could reflect list freshness rather than creative quality. Proper randomization at the contact level is essential.

Testing During Anomalous Periods

Running a test during a holiday, a major news event, or a platform outage can produce results that do not generalize to normal conditions. Where possible, run tests during typical sending periods and avoid days with known anomalies.

Changing Variables Mid-Test

If you modify a variant's landing page, offer, or tracking link during the test, you are no longer running a clean experiment. Lock all variables before the test begins.

Ignoring Segment-Level Effects

An overall winner might not be the winner for every segment. Variant A might outperform among highly engaged subscribers while Variant B wins among less active contacts. If your audience is heterogeneous, consider running segment-level analyses after the main test concludes. Platforms like Trackly make this feasible through audience segmentation with custom labels and engagement scoring, allowing you to break down results by meaningful subscriber groups.

Conflating Clicks With Conversions

A message that generates more clicks does not necessarily generate more conversions or revenue. If your goal is downstream conversion, measure that — not just CTR. This often requires longer observation windows and integration with conversion tracking systems.

Practical Tips for SMS Marketers With Smaller Lists

Not every SMS program has hundreds of thousands of subscribers. If you are working with a list of 5,000–10,000 contacts, the following approaches can help you run meaningful tests.

Quick-Reference Checklist for Your Next SMS A/B Test

  1. Define the metric you are optimizing (CTR, conversion rate, opt-out rate).
  2. Establish your baseline rate from recent campaign data.
  3. Set your minimum detectable effect — the smallest improvement worth detecting.
  4. Choose your significance level (typically 0.05) and power (typically 0.80).
  5. Calculate the required sample size per variant.
  6. Confirm your list is large enough. If not, adjust your minimum detectable effect or test design.
  7. Randomly assign contacts to variants.
  8. Lock all variables (copy, links, landing pages, send time).
  9. Send the test and wait for the full observation window.
  10. Analyze results using confidence intervals, not just p-values.
  11. Check for practical significance, not just statistical significance.
  12. Document the result and apply the learning to future campaigns.

Moving From Manual Testing to Continuous Optimization

Understanding sample size and statistical significance is foundational knowledge for any SMS marketer running tests. It prevents premature decisions, wasted budget, and false confidence in results that are really just noise.

The trajectory of the industry, however, is moving toward systems that handle this math continuously and automatically. Trackly's algorithmic creative selection is one example: by using Thompson Sampling to dynamically allocate traffic across message variants, it reduces exploration cost while converging on top-performing creative. Marketers still need to supply strong, differentiated creative variants — the algorithm cannot optimize if all the inputs are mediocre — but the statistical machinery runs in the background.

Whether you run tests manually or use algorithmic tools, the principles remain the same. Know your baseline. Define what improvement matters. Ensure you have enough data to detect it. And resist the urge to call a winner before the math supports it.