Email A/B Testing: What to Test and How to Read Results (2026)
Most email A/B tests are run incorrectly. Not because the testers do not care — but because the way most email platforms present A/B testing makes it deceptively easy to draw the wrong conclusions from real data.
You split your list 50/50. Variant A gets a 24% open rate. Variant B gets a 27% open rate. Variant B wins — you apply it to future campaigns. But your list was 800 subscribers. The difference was 24 opens on a test that needed 2,400 opens per variant to be statistically meaningful. You have just made a confident decision based on noise.
This guide covers email A/B testing the right way: what to test and in what order, how to structure tests so results are actually trustworthy, how to interpret statistical significance without a statistics degree, and how to build a compounding testing programme that consistently improves every performance metric over time.
What Is Email A/B Testing?
Email A/B testing — also called split testing — is the practice of sending two or more versions of an email to different portions of your list, measuring the response to each version, and using the results to identify which performs better.
The logic is straightforward: instead of guessing which subject line, send time, or CTA will perform better, you test both simultaneously with real subscribers and let the data decide.
The key distinction: A/B testing is not the same as sending two different campaigns and comparing them. A valid A/B test changes only one variable between the two versions and sends them simultaneously to randomly divided segments of the same list. Any other design introduces confounding variables — differences in send time, audience composition, or seasonal context — that make the results uninterpretable.
Why Most Email Tests Fail to Produce Useful Results
Before covering what to test, it is worth understanding why so many A/B tests produce misleading data — because the mistakes are systematic and avoidable.
Mistake 1: Testing With Too Small a Sample
The most common and most consequential mistake. Every A/B test produces a result — but not every result is meaningful. The question is not "which variant got a higher open rate" but "is the difference large enough to be confident it reflects a real preference rather than random chance?"
Statistical significance requires a minimum sample size that depends on your baseline conversion rate and the size of improvement you are trying to detect. For email open rates:
| Baseline Open Rate | Detectable Lift | Minimum List Size Per Variant |
|---|---|---|
| 20% | 5% relative (20% → 21%) | ~16,000 subscribers |
| 20% | 10% relative (20% → 22%) | ~4,000 subscribers |
| 20% | 20% relative (20% → 24%) | ~1,000 subscribers |
| 30% | 10% relative (30% → 33%) | ~4,500 subscribers |
Practical implication: Most businesses cannot run statistically valid tests for small improvements. Instead, test larger changes — dramatic differences in subject line approach, not minor word tweaks — so the signal-to-noise ratio is high enough to be detectable with realistic list sizes.
Mistake 2: Testing Too Many Variables Simultaneously
Testing subject line AND from name AND send time in the same test makes it impossible to know which change drove the result. This is multivariate testing — which requires exponentially larger sample sizes to produce valid results. A/B testing means one variable at a time.
Mistake 3: Stopping the Test Early
When one variant pulls ahead quickly, the temptation is to stop the test and declare a winner. This leads to a well-documented statistical phenomenon called "peeking" — the early leader is often just the beneficiary of early variance, and longer-running tests frequently reverse or narrow the gap. Let the test run to its predetermined sample size before reading results.
Mistake 4: Not Testing the Same Audience Over Time
Running a test on Monday to engaged subscribers and comparing it to Tuesday's test on your full list is not a valid A/B test — the audiences are different. Always test within the same send, to randomly divided segments of the same list, at the same time.
Mistake 5: Treating Open Rate as a Reliable Metric Since Apple MPP
Since Apple Mail Privacy Protection launched in 2021, open rates have been inflated for senders with significant Apple Mail audiences. MPP pre-fetches email content and registers an "open" regardless of whether the email was actually viewed. For subject line tests — where open rate is the primary metric — Apple MPP introduces noise. Supplement open rate testing with click rate testing on campaigns where the audience mix is uncertain.
What to Test and in What Order
Not all email variables are equally testable or equally impactful. This priority order reflects both the size of the potential improvement and the sample size required to detect it reliably.
Priority 1: Subject Line (Highest Impact, Easily Detectable)
Subject line is the most impactful variable in email marketing because it determines whether the email gets opened at all — and a 20–30% relative improvement in open rate from a better subject line has a compounding effect on every downstream metric.
Subject line tests are also the easiest to run because the metric (open rate) is available within hours and the sample size requirement is lower than for click or conversion tests (because open rates are higher and therefore easier to measure).
High-value subject line variables to test:
| Variable | Option A | Option B |
|---|---|---|
| Format | Question: "Are you making this email mistake?" | Statement: "The email mistake costing you 30% of opens" |
| Length | Short (under 40 chars): "Your account needs attention" | Long (50+ chars): "Three things to check before your next email campaign" |
| Personalisation | With first name: "Sarah, your open rate dropped" | Without: "Your open rate dropped — here's why" |
| Urgency | Time-limited: "Offer ends tonight at midnight" | Value-first: "The deliverability fix most marketers miss" |
| Specificity | Specific number: "7 email tests worth running in 2026" | General: "Email tests worth running this year" |
| Tone | Formal: "Quarterly performance review available" | Conversational: "Quick question about your Q1 sends" |
What makes a valid subject line test: The email body, send time, and from name must be identical. Only the subject line changes.
Priority 2: From Name (High Impact, Underused)
The from name is the second thing subscribers read after checking which tab the email landed in. Testing from name can reveal significant open rate differences — particularly the difference between a brand name ("Migomail") and a person name ("Hemant from Migomail").
Common from name test patterns:
| Variant A | Variant B | Expected Finding |
|---|---|---|
| Brand name only: "Migomail" | Person + brand: "Hemant at Migomail" | Person name typically wins for newsletters and smaller brands |
| Full name: "Hemant Verma" | First name only: "Hemant" | First name only often feels more personal |
| Role-based: "Migomail Deliverability Team" | Personal: "Aisha from Migomail" | Personal typically outperforms role-based |
Note: From name is tied to your From email address in most platforms. Changing the display name without changing the email address is possible — test the display name independently first.
Priority 3: Send Time and Day
Send time tests are operationally simple — same email, different dispatch times — but require careful execution to ensure the audience segments are randomly divided and not systematically biased by time zone.
Common patterns in US sender data from our email deliverability benchmarks:
- B2B: Tuesday–Thursday, 10am–12pm local time tends to outperform Monday and Friday
- B2C ecommerce: Saturday morning (8–10am) often outperforms weekday sends for promotional campaigns
- Newsletters: Sunday evening (7–9pm) performs well for content-focused newsletters
These are averages. Your specific audience may behave differently. Run a 4-week send time test — split your list 50/50, send the same campaign at two different times across four consecutive sends, then aggregate the results. Four sends per variant reduces single-send noise significantly.
Priority 4: Email Length and Format
Long-form email vs short email, HTML-heavy vs plain text, image-led vs text-led — these tests measure engagement quality (click rate, replies) rather than just opens.
Format tests to run:
| Variable | Option A | Option B |
|---|---|---|
| Length | Short (under 150 words) | Long (400+ words) |
| HTML vs plain text | Designed HTML email | Plain text with minimal formatting |
| Image usage | Image in the hero section | No images, text only |
| Single vs multi-column | Single column | Two-column layout |
Plain text emails frequently outperform HTML emails for newsletters, re-engagement campaigns, and any sequence where personal connection is the goal. HTML emails outperform for product catalogues, promotional emails, and content where visual hierarchy matters. Test this for your specific use case rather than assuming one format always wins.
Priority 5: Call to Action (CTA)
CTA testing measures click rate — which is a more reliable metric than open rate in the Apple MPP era because clicks cannot be pre-fetched. However, click rates are lower than open rates (2–5% vs 20–30%), which means you need a larger sample size to detect meaningful differences.
CTA variables worth testing:
| Variable | Option A | Option B |
|---|---|---|
| Button text | "Start your free trial" | "Try Migomail free for 14 days" |
| Button vs link | Designed button | Plain text hyperlink |
| CTA position | Above the fold | Below main content |
| Number of CTAs | Single CTA | Two CTAs (primary + secondary) |
| Colour / design | Blue button | Orange button |
| First person | "Start my free trial" | "Start your free trial" |
First-person CTA text ("Start my free trial" vs "Start your free trial") is one of the most consistently replicated findings in email CTA testing — first person often outperforms second person by 5–15%, likely because it forces the reader to mentally inhabit the action rather than receiving an instruction.
Priority 6: Email Content and Body Copy
Content tests — testing different value propositions, different lead paragraphs, different proof points — are the most complex and most valuable tests in a mature email programme. They are also the hardest to isolate and require the largest sample sizes.
Content tests to run once the higher-priority variables are settled:
- Opening line: question vs statement vs fact
- Value proposition angle: benefit-led vs feature-led vs story-led
- Social proof type: customer quote vs data/statistics vs customer story
- Offer framing: percentage discount vs dollar amount discount vs free shipping
- Urgency mechanism: countdown timer vs limited stock vs expiry date
How to Structure a Valid A/B Test
Step 1: Define Your Hypothesis
Every test should begin with a specific hypothesis: "I believe [Variable A] will outperform [Variable B] because [reason]." This disciplines you to test for a reason rather than testing randomly, and it provides a framework for interpreting the results even when they surprise you.
Example hypothesis: "I believe a subject line with a specific number ('7 email tests worth running') will outperform a general subject line ('Email tests worth running this year') because our audience has shown stronger click rates on specific, numbered content in the past."
Step 2: Choose Your Metric Before Sending
Decide the success metric before the test runs — not after you see the results. The most common choices:
- Open rate: For subject line and from name tests
- Click rate (CTR): For CTA, content, and format tests
- Click-to-open rate (CTOR): For content quality tests — measures clicks per opener, removing open rate variance
- Revenue per email: For offer and pricing tests — requires ecommerce integration
Choosing your metric after seeing results (also called HARKing — Hypothesising After Results are Known) systematically produces false positives.
Step 3: Calculate Required Sample Size
Use a sample size calculator (several free online tools exist, including tools from Evan Miller and AB Testguide) before running the test. Input:
- Your current baseline metric (e.g., 22% open rate)
- The minimum improvement you want to detect (e.g., 15% relative lift → 22% → 25.3%)
- Your desired confidence level (95% is the standard for business decisions)
The calculator returns the minimum number of subscribers needed per variant. If your list is smaller than 2× this number, the test will not produce reliable results — consider testing larger changes to make the signal detectable with your available sample.
Step 4: Split Randomly
Most email platforms have built-in A/B testing that randomly assigns subscribers to variants. Use this rather than manually dividing your list — manual division introduces systematic biases (e.g., alphabetical order by name correlates with geographic and demographic patterns).
Step 5: Run the Test to Completion
Set a predetermined end point — typically when the required sample size is reached — and do not check results until then. If your platform sends the winning variant automatically after a test period, ensure the test period is long enough to accumulate the required sample before the winner is declared.
Step 6: Interpret Results With Appropriate Confidence
After the test completes, a result is valid if:
- The sample size per variant met or exceeded your pre-calculated requirement
- The difference between variants is larger than the margin of error at your chosen confidence level
- The test ran simultaneously (not sequentially) and to a randomly divided audience
A result that meets all three criteria is a directional finding you can act on. A result that does not meet these criteria is interesting data — but not a basis for confident action.
Reading Results: Statistical Significance in Plain Language
Statistical significance sounds intimidating but the concept is simple: how confident are you that the observed difference is real rather than random chance?
A test result at 95% confidence means: if you ran this exact test 100 times, 95 of those times you would observe the winning variant performing better. 5 times you would not — those are false positives.
Most A/B testing tools display a confidence level or a p-value:
- p < 0.05 = 95% confidence = statistically significant at the standard business threshold
- p < 0.01 = 99% confidence = higher confidence, requires more data
- p > 0.05 = not statistically significant = do not act on this result
The practical translation: If your email platform says "Variant B is the winner" but the confidence is 78%, the result is not reliable. A "winner" at 78% confidence means you have a 22% chance of being wrong — which is too high to make permanent changes to your programme. Wait for more data or test a bigger difference.
What to Do When Tests Are Inconclusive
An inconclusive test — where neither variant clearly wins — is not a failed test. It is meaningful information: the difference between the two variants is not large enough to matter to your audience. Options:
- Test a more extreme version: If "Get started" vs "Start your free trial" was inconclusive, test "Get started free today" vs "See results in 14 days" — a larger, more conceptually different change
- Accept the null hypothesis: Some variables genuinely do not move the needle for your specific audience. Document this and redirect testing effort to higher-impact variables
- Aggregate across multiple tests: Run the same test across 3–4 campaigns and aggregate the results. The combined sample may produce statistical significance that a single test could not
Building a Compounding Testing Programme
Individual A/B tests are useful. A systematic testing programme — where every test builds on the last and creates documented institutional knowledge about your audience — is transformational.
The Testing Calendar
Commit to one meaningful test per campaign send. After 12 months of consistent testing, you will have:
- A documented subject line playbook specific to your audience
- A clear winner on from name format
- Optimal send time and day confirmed by multiple rounds of data
- A CTA format that consistently outperforms alternatives
- Content direction validated by engagement data
This is the difference between an email programme that incrementally improves and one that plateaus.
Document Every Test
Maintain a simple testing log:
| Date | Variable Tested | Variant A | Variant B | Sample Size Each | Winner | Confidence | Learning |
|---|---|---|---|---|---|---|---|
| 2026-01-15 | Subject line format | Question | Specific number | 2,400 | Specific number | 97% | Numbers outperform questions for this list |
| 2026-01-29 | Send time | Tuesday 10am | Thursday 10am | 2,400 | Tuesday | 91% | Directional — re-test |
| 2026-02-12 | From name | "Migomail" | "Hemant at Migomail" | 2,400 | "Hemant at Migomail" | 98% | Person name wins — apply permanently |
After six months, patterns emerge that are specific to your list and your audience. These patterns are more valuable than any generic best practice guide — including this one.
Segment-Level Testing
Once your overall list testing matures, test within segments rather than across your full list. Champions (your most engaged subscribers) may respond differently to subject line styles than Cooling subscribers. Your email list segmentation guide covers how to set up the engagement tiers that make segment-level testing possible.
Testing within the Champions segment produces results faster (higher engagement rates mean smaller required sample sizes) and more reliably (less noise from disengaged subscribers who open inconsistently).
A/B Testing for Automated Drip Sequences
A/B testing is not limited to broadcast campaigns — it is equally valuable for drip campaign sequences and automation workflows.
Testing within automation sequences:
- Welcome series: test Email 1 subject line variations with new subscribers (every new subscriber is a new test participant)
- Abandoned cart: test the 30-minute vs 60-minute send delay for Email 1
- Re-engagement: test "Still want to hear from us?" vs "We saved something for you" as the Email 1 subject
Automation A/B tests accumulate sample over time as subscribers continuously trigger the sequence — a welcome series test running for 60 days with 50 new subscribers per day accumulates 3,000 per variant, which is sufficient for most subject line tests.
The advantage of automation testing:
Unlike broadcast campaign tests that run once, automation tests run continuously and accumulate statistical significance over weeks or months. They also test the same audience type consistently — every new subscriber in a welcome series test is a new subscriber, eliminating the audience composition variance that affects broadcast tests.
A/B Testing Checklist
Before the test
- Hypothesis documented — specific prediction and reasoning
- Success metric chosen before test runs — not after
- Sample size calculated using a significance calculator
- List size is at least 2× the required sample size per variant
- Only one variable changes between variants
- Both variants will send simultaneously to randomly divided segments
During the test
- Test is not being checked until the required sample is reached
- No other campaign changes happening to the same audience segment
- Platform is splitting randomly (not manually)
After the test
- Sample size requirement was met
- Confidence level is 95% or above before declaring a winner
- Result is documented in the testing log with learning noted
- Winner applied to future sends (if conclusive result)
- If inconclusive: test a more extreme variant or accept the null hypothesis
Ongoing programme
- One test per campaign send scheduled consistently
- Testing log reviewed quarterly for patterns
- Segment-level tests running in parallel with full-list tests
- Automation sequences have active A/B tests running
Frequently Asked Questions
What should I A/B test first in email marketing?
Start with subject lines. Subject line tests produce the largest and most easily detectable improvements, require the smallest sample sizes (because open rates are higher than click rates), and deliver results within hours of sending. A 15–20% relative improvement in open rate from a better subject line approach — for example, numbered lists vs general statements — compounds across every campaign you send going forward. After subject lines, test from name format (brand name vs person name), then send time, then CTA text and placement. Test one variable at a time, in this order, before moving to more complex content tests.
How many subscribers do I need to run a valid email A/B test?
It depends on your current open or click rate and the size of improvement you want to detect. For a subject line test with a 20% baseline open rate and a minimum detectable lift of 20% relative (20% → 24%), you need approximately 1,000 subscribers per variant — 2,000 total. For a smaller lift of 10% relative (20% → 22%), you need approximately 4,000 per variant — 8,000 total. Most businesses with lists under 2,000 subscribers cannot run statistically valid tests for small improvements. The solution: test larger, more dramatically different variants so the effect size is large enough to be detectable with your available list size.
How long should I run an email A/B test?
Run the test until both variants have accumulated the required sample size — not until a specific time has elapsed. For a broadcast campaign send, both variants receive their traffic simultaneously, so the test ends when the send completes. For automation sequence tests, the test runs continuously until the accumulated sample meets the threshold. The common mistake is stopping early when one variant pulls ahead — early leaders often have their lead narrow or reverse as more data accumulates. Set a predetermined sample size requirement and do not read results until it is reached.
What is the difference between A/B testing and multivariate testing in email?
A/B testing changes one variable between two versions of an email — subject line A vs subject line B, with everything else identical. Multivariate testing changes multiple variables simultaneously — subject line, from name, and CTA all tested at once across multiple combinations. Multivariate testing requires exponentially larger sample sizes (typically 50,000+ subscribers per variant combination) and is impractical for most email senders. A/B testing — one variable, two variants, tested sequentially over time — produces clear, actionable results and is appropriate for any list size above approximately 2,000 subscribers.
Does A/B testing improve email deliverability?
Not directly — A/B testing improves the engagement signals that influence inbox placement over time. When A/B testing leads to consistently higher open rates, click rates, and lower complaint rates (through better relevance), inbox providers see sustained positive engagement from your domain. Over months, this stronger engagement signal translates to better inbox placement. The most direct deliverability lever is authentication and list hygiene — covered in our email deliverability best practices guide. A/B testing is an engagement optimisation tool that, compounded over time, contributes positively to deliverability as a secondary effect.
Summary
Email A/B testing works — but only when done correctly. The failures that make most tests useless are systematic and avoidable: too small a sample, too many variables changed at once, stopping tests early, and reading results at inadequate confidence levels.
The right approach:
- Test one variable at a time — subject line first, then from name, send time, CTA, format, content
- Calculate required sample size before testing — not after observing results
- Choose your metric before sending — not based on which metric happened to show a difference
- Run to completion — do not peek at results until the required sample is reached
- Require 95% confidence before declaring a winner
- Document everything — the compounding value of a testing programme is in the patterns it reveals over months, not individual test results
A consistent one-test-per-campaign programme applied for 12 months produces a subject line playbook, a from name format, an optimal send time, and a CTA approach — all validated by your specific audience rather than generic industry averages. That documented knowledge is genuinely difficult to replicate and compounds with every campaign you send.
Start your free trial to access Migomail's built-in A/B testing — subject line, from name, send time, and content variant testing with random audience splitting, real-time significance tracking, and automatic winner application all built into the campaign builder.