AB Testing
Jun 23, 2026
Which Parts of Your Testing Workflow Should AI Handle?
AI is taking over the operational parts of testing: research, brief writing, reporting, and increasingly variant building. What still needs your judgment, and how to think about where the line sits today.

Carlos Trujillo

Testing workflows have historically carried a heavy operational load: research, briefs, building, reporting. Each required a person because there was no faster way. AI has changed that, with tools like Intelligems now handling everything from variant building to experiment analysis. What still needs you is the judgment layer. Here's where the line sits today.
The ceiling the old workflow created
Before AI (more specifically, the mass adoption of large language models, or LLMs), the time cost of running a proper experimentation program was real. Every one of those steps required hours that came directly out of strategy time. You could run fewer tests, or you could run tests with less rigor. The tradeoff was built in.
That ceiling mattered because when the operational side expands to fill the calendar, the high-value work gets compressed: deciding what to test, understanding what a result actually means, pushing the program forward organizationally. The goal was always to have more time thinking and less time documenting. The tools are finally catching up to that.
What you can reasonably hand off today
Research and synthesis
Reviewing customer reviews, scanning competitor sites, pulling research from multiple sources: this is where you tend to see gains fastest. AI doesn't bring special insight here, but it handles volume without losing focus. It'll work through hundreds of data points consistently in a way that's hard to sustain manually. Tools like Intelligems can also surface experiment ideas directly from your store's data and rank them by projected profit impact, so you're not starting from a blank page. You still decide what matters. It just gets you there faster.
Brief and hypothesis writing
Give AI solid context (what you're testing, what the data is showing, what you're trying to learn) and it can produce a cleaner brief than many practitioners write manually. Structured documents tend to be a genuine strength of current language models. That doesn't mean you skip the thinking. It means the thinking gets written up better.
Test reporting
The mechanical part of writing "test X ran for 14 days, showed a 2.8% lift in revenue per visitor, 95% likelihood of beating control" requires relatively little human judgment compared to what comes next. Intelligems AI connects to your workflow a few different ways for this: a Slack bot for plain-language questions right where your team already works, MCP integrations with Claude, ChatGPT, or Gemini for deeper analysis and cross-referencing other data sources, and an API for teams that want to build custom reporting pipelines. The common thread is cutting the time between "test ended" and "understood." What still genuinely requires judgment is the interpretation: what does this result mean for the business, and what do you do next? That's where your attention belongs.
Prototyping
Getting an idea from a strategist's head into a designer's hands used to involve significant back-and-forth. A rough AI-generated mockup can close that gap considerably. You're not building production UI. You're showing clearly enough what you mean that misinterpretation becomes rare.
Variant building
This is where things are moving fastest right now. Intelligems has AI-assisted variant building: describe the change you want, and the system builds it. It still works best with a clear design system and good context. But when that's in place, the execution lift can drop significantly, and a solo operator running multiple experiments simultaneously starts to become realistic.
EP 031: how agentic workflows are changing what optimization actually looks like day to day.
What this means for who can actually run a program
There used to be something close to a rule in experimentation circles: you couldn't run a real testing program solo. It wasn't wrong. It reflected the actual operational weight of the work. Research. Design. Development. Someone technical for QA. The workflow was sequential enough that one delayed step meant one delayed test, and one person can't be five people at once.
That assumption is getting harder to hold. AI compresses the operational steps enough that a solo operator or a small growth team can now run a legitimate testing program. Not a stripped-down version of one, but something with real rigor.
For lean brands, this means you don't have to hire for a testing practice before you can build one.
The bigger shift here goes beyond headcount, though. When the operational overhead drops low enough, experimentation doesn't have to be its own discipline. A product manager can test a feature before rolling it out. An ecommerce manager can run a checkout flow test as part of their normal workflow, without routing it through a dedicated team. You don't need a separate optimization function to ask "should we test this before we ship it?" That question, asked consistently by the people already building the product, is most of what a real testing culture actually looks like.
What still needs you, as of June 2026
This list is going to get shorter. Some of these will sound obvious to hand off in a matter of months, and the honest answer is that a few probably will be. But as of right now, here's where AI still falls short.
Reading results in context
A test result doesn't mean the same thing in every situation. Did this win because the variant was better, or because you ran it during a promotional period with unusual traffic? Is this generalizable, or was the sample skewed toward a specific audience? Those questions require someone who knows the account. AI can increasingly pull context from multiple sources (calendar data, campaign logs, traffic anomalies) but right now it still needs a person to connect those dots and make the call. This is probably the next thing to shift, once tooling catches up to give AI a fuller picture of what was happening in the business during a test.
Stakeholder conversations
This one is likely last. Presenting a failed test to a skeptical CMO, explaining statistical significance to someone who wants to call a winner early, defending a testing roadmap when there's pressure to just ship: this isn't a data problem. It's a trust problem. Building and maintaining organizational belief in experimentation requires relationships, and relationships are still human. AI can help you prep for those conversations. It won't have them for you.

Why faster shipping makes measurement more important, not less
The bottleneck in ecommerce optimization used to be execution: getting tests built, shipped, and reported on. AI is removing that constraint. But when execution becomes easier for everyone, what creates a real advantage is prediction quality: how well you read your data, how accurately you interpret what a result means for your specific store, and how well-calibrated your judgment is about what to test next.
That's why the measurement layer becomes more valuable, not less, as these tools improve. When shipping variants gets faster and cheaper, more things get shipped. The risk of moving without knowing whether what you shipped actually helped goes up. That's what experimentation is for, especially when you're measuring something closer to profit per visitor than just conversion rate.
The practitioners getting the most from AI right now tend to be the ones who've used the time savings to go deeper on the parts that still require them.

Your data still points the way
AI keeps taking on more of the operational work. The judgment layer stays with you.
Start by looking at where your time actually goes in a given test cycle. Whatever step is consuming the most time without requiring your judgment is the first candidate to hand off. Try it with one task, see how far the time savings go, then point that freed-up energy at the parts that still need you.
Ecommerce Strategy
Expert Guide
Content Testing


