Imagine you’ve been working on optimizing a site for a while now, say 3, 6 or even 12 months.

You’ve had solid winners each month, and you’re confident in the test results. These are not imaginary lifts. But now your conversion rate looks the same as when you started. How do you explain this to the boss/client?

Another scenario: you’ve been optimizing for 12 months and your revenue per customer has increased by 2%. Same question: how can you justify your contribution? How can you tell what caused that – optimization, SEM, seasonality, word-of-mouth, or something else?

How do you measure the the ROI of your optimization efforts? The question is actually more complicated than it sounds.

ROI is Difficult to Measure

In fact, it’s easy to project the predicted ROI of optimization (click here to download conversion optimization ROI calculator). It’s just really hard to measure it, post-hoc.

roi of optimization calculator

In 2012, MarketingSherpa posed the question, “Did optimization or testing demonstrate ROI in 2011?” Here are the results:

roi of optimization chart

Not really surprising, really. Measuring ROI of optimization is hard. If anything, I’m skeptical of the 38% that demonstrated positive ROI. How, indeed, did they demonstrate ROI?

There’s a quote in the article from Amelia Showalter, former Director of Digital Analytics for Obama for America, that explains how hard it is to track and measure everything, at least in the long term:

“When we’re working on the campaign, we’re actually working so hard to run all those tests that we didn’t always keep perfect track of exactly what results were long term. It’s hard to calculate this stuff out when we want to put all our resources into running more tests. So, we don’t actually ever have a perfect estimate of actually how much extra revenue was due to our testing, but I think that $200 million is a fairly reasonable estimate.”

The article also sums things up by saying, “You can also take heart that if you’re running valid tests, you are likely improving the bottom line.”

While that’s heartwarming, it’s not going to satisfy a neurotic boss or client. We’ve got to corner a way to measure our impact. How can we possibly do that?

Time Period Comparison in Analytics (and Why It’s Wrong)

If asked to measure improvement in conversion rate due to optimization efforts, most people would point to Google Analytics. They would perform a time period comparison, looking back 6-12 months ago when you started the campaign and comparing with the conversion rate you have now (linear analysis).

This won’t tell the full story for a few reasons, the big one being the variability of your traffic quality.

As Chris Stucchio from VWO said:

Chris Stucchio:
“Time period comparison doesn’t work simply because, in mathematical terms, time period is not statistically independent of visitor behavior. Visitors arriving, e.g., before Valentine’s day, are simply more likely to buy flowers. So if time period A is in Jan and time period B is in Feb, you’ll sell more flowers in period B regardless of any changes you’ve made to the site. There are a few ways of drawing weak inferences from time period comparisons, e.g. Google’s causal impact and similar tools, but these are pretty advanced, hard to use, and strictly less accurate than a holdback set.”

Several things can affect your traffic quantity and quality, including but not limited to:

  • Season
  • Holidays
  • Press (positive or negative)
  • SEO
  • Word-of-Mouth

Let’s say you’re at a 2% conversion rate with 100,000 monthly visitors to start. Over the course of a year, a lot can change the quality of your traffic. If you’re selling novelty gifts, the holidays might improve your conversion rate with negligible impact from your optimization efforts. Similarly, if you hit the front page of Hacker News, you’ll get a lot of traffic – but the quality might be really shitty, lowering your present average conversion rate.

Conversion Rates Are Non-Stationary Data

“I have seen the future and it is very much like the present, only longer.”
Kehlog Albran, The Profit

A stationary time series is one whose statistical properties (mean, variance, autocorrelation, etc) are constant over time. According to an article on Duke University’s website, “A stationarized series is relatively easy to predict: you simply predict that its statistical properties will be the same in the future as they have been in the past!”

But as Investopedia says, data points are often non-stationary:

“Non-stationary data, as a rule, are unpredictable and cannot be modeled or forecasted. The results obtained by using non-stationary time series may be spurious in that they may indicate a relationship between two variables where one does not exist.”

Stationary dataImage Source

As Andrew Anderson from Malwarebytes told me, “All data is sinusoidal, it goes up and it goes down, despite test results.” Like this:


That’s essentially the nature of data. Whether because of seasonality, day of week, external factors, press, advertising, etc, data just fluctuates. Even if you didn’t change anything on your site for a month, you’re not going to get the same result every day. It will fluctuate – sometimes a little, sometimes a lot.

Andrew cites this as the reason time period comparison in analytics won’t work for accurately measuring your ROI, and he gives a great example below:

Andrew Anderson, Head of Optimization at Malwarebytes
“Because linear analysis can only show you where you are, not where you should have been. You can have a much better outcome and still end up lower then when you started. Just as you can have an awful outcome and end up higher then when you started.”

In both of the examples the blue line is the best option, but in one, the yellow line (the worst outcome) is much better than when the test started. Just as in the second part the blue line is very superior but also lower overall than the start of the period.”

sinusoidal data

He continues:

“You can be costing your company millions and think that everything is better by relying on pre/post. Because of this it is less useful than just flipping a coin. Both have nothing to do with measuring the outcome of a change, but at least with the coin you won’t confidence yourself that the data means something.”

fluctuating data setsYou can see the fluctuations in the raw data as well as the trends

A Possible Exception

After talking to Craig Sullivan, I found out it is possible to do time period comparison. However, you have to have a predictable traffic stream (ie PPC) and even then it is rough. Craig explains it well:

Craig Sullivan, Optimal Visit

“The problem with time period comparison is that you can’t ensure that the visitor traffic is reliable over the two periods. Seasonality, advertising, marketing, competitor activity, market changes, weather and many factors can skew your two samples, so you’re not comparing apples with apples. However, if you can maintain a predictable traffic stream of people with similar intent and makeup across the two time periods, you may have a chance to do a rough comparison. For example, if my PPC advertising is consistent over an 8 week period, I can make a flawed but useful comparison of the before and after effects – because I’m comparing a segment rather than ‘all traffic behaviour’.

Of course, this won’t tell me with precision but it can indicate if it’s much worse, much better or we’re not really sure. If I’m comparing deep segments (for example, the people that arrive and start filling out a lead gen form for a tractor model) then why not compare them over the two time periods? Sure there is variability but some of these segments are consistent in desire and intent across the time periods. If I’m comparing traffic at the outer layers of a site, time period comparison becomes much less reliable.

Lastly, there are some tools now like GA Effect that help you work out if the change you made was responsible for the ‘effect’ you saw. Did those 30 new SEO pages cause the rise in conversion or was it just noise or something else? I feel that time period analysis is flawed in the way it’s normally approached – there are some exceptions though!”

If we were to assume that PPC traffic is “reliable”, we’d also have to assume that you haven’t changed daily budget, haven’t changed your keywords, and haven’t changed your ad copy, then there are too many variables. It’s not the same traffic anymore. In addition, the variables are constantly changing in AdWords, sometimes daily:

Also – you also can’t draw broad conclusions from PPC data because you can’t assume that all traffic sources will behave similarly. What works for PPC traffic might not work for returning direct traffic, SEO traffic and so on.

Tests to Gauge Impact

“It can be extremely difficult to explain results when it looks like things are flat or overall down. The fundamental problem is that people are using a linear correlative data set instead of the comparative data that a test provides, or in other words you are saying that you are X percent better, not necessarily X percent better of a specific number. All data is sinusoidal, it goes up and it goes down, despite test results.”

-Andrew Anderson

If time period comparison won’t work, what will? There are a few ways to measure impact. None of them are perfect – and there are pros and cons of each – but nonetheless, they’re better than nothing.

1. Retest old versions of the site later on

One of the easiest ways to measure ROI is to retest old versions of the site as part of larger tests later on. Basically, all changes made during the testing period (combined into one metric) tested against the old version.

As Andrew Anderson said:

Andrew Anderson, Head of Optimization at Malwarebytes
“After 5 months of continual optimization and an estimated impact of about 120% increase we decided to include the original design of the site as part of a larger test. What we found was that the actual impact was actually 257% better and that what was really driving down performance was an SEM approach of pushing for quality leads over all leads, dropping overall performance. This lead to eyes being diverted from blaming the testing program and instead a re-evaluation of the SEM program, while at the same time really granting the optimization program unlimited access to changing the user experience.”

Though like most methods here, there are some pros and cons. According to Craig Sullivan, if you’ve been continuously improving and learning, it might not be worth the time to test an old version. Craig:

Craig Sullivan, Optimal Visit
“It’s like Mcdonalds saying ‘Let’s go back to the 2007 restaurant format, to see how it works’ or Facebook saying ‘Let’s use an old design for the app’ – it makes sense to validate this stuff but I think it makes more sense to move the product on so far that this validation is worthless.”

2. Weak Causal Analysis

Another method is weak causal analysis.

As Andrew Anderson said, “use weak causal analysis to get a read on estimated impact. In both cases (cause analysis and retesting old versions) you will often find that you are actually having a bigger impact than you imagine. It is important that you are doing this analysis without prompting and proactively giving others a full evaluation of the overall program.”

What’s is weak causal analysis? Basically this: Do a long term trend line with an estimated error rate. Take that based on prior data before the change and look at the outcome as compared to the expected outcome of the trend line. Make sure you are using independent variables as a basis (like users) so that you can get some read on where you would have been versus where you are.

correlation vs causationImage source: correlation vs causation

“Anything that can approximate causal information is better than nothing but has a much higher chance of Type I or Type II errors (a ‘false positive’ and a ‘false negative,’ respectively),” according to Andrew.

Not perfect but better than nothing.

3. Measure Impact Through Various Stages of a Funnel

According to Chris Stucchio, another method is to attempt to “measure the effect of your optimizations on the various stages of a funnel.”

Say these are your stages:

  • Step 1: click from email to site
  • Step 2: add product to cart
  • Step 3: go to checkout
  • Step 4: buy

It’s possible you might not have enough data to actually measure a difference at step 4. But as Chris said:

“You can often infer data about step 4 from steps 1-3 (i.e. if you made a significant impact on the percentage of people reaching step 3, it is *likely* (though not guaranteed) that you increased conversions). There rigorous ways to estimate this statistically, but they are again somewhat difficult to do.”

4. Send a small part of your traffic to a consistent base

Here’s what Lukas Vermeer said in a previous quote:

Lukas Vermeer, Data Scientist at

“If you really want to know, you can do that by sending a small fraction of all traffic to a consistently stable base, which never changes, will tell you how much better your site is performing now than it was before.

Usually this is technically difficult, but might be needed to shed light on the cost of inaction/not testing.”

Sending a small part, 5-10% of your total traffic, to a consistent control seems to be the most accurate way to track impact of optimization. This is the method that I heard most consistently from expert optimizers, anyway.

Chris Stucchio explains how this works:

Chris Stucchio:
“The only real reliable way to measure the ROI of a sequence of optimizations is to use a holdback set. One of our (CRO only) agency partners plans, with all their customers, to refuse to optimize more than 90% of traffic. Then they will compare the traffic of their 90% to the remaining 10%; the difference (provided it is statistically significant) can be reasonably attributed to them.”

Of course, the question then is with the opportunity costs. If you’re not optimizing 10%, you’re (maybe) missing out on increased revenue. You’re also dealing with less optimizable traffic, so tests will take longer to reach significance.

Are There Opportunity Costs?

As Peep said in a previous article, “Testing something is an opportunity cost – means you can’t test something else. While I’m re-validating something here, I could be testing something else that gives me a lift (but of course, it’s not possible to know whether it would). It’s also questionable whether you should be re-testing it.”

Or as Joshua Kennon put it, “everything in life has an opportunity cost.”

This is a question of your specific goals and risk tolerance. Andrew Anderson explains that it’s always worth it to improve your performance, which might mean taking the time to test impact over the long term:

Andrew Anderson, Head of Optimization at Malwarebytes
“It is always worth trying to improve your performance, but if you are too small to do more rigorous testing, then the key is to go big or don’t do it at all. The issue is that it is really hard to measure small changes (like 2%, or even 20% in some cases). Those changes add up and can be more valuable than shooting for a home run, but if you can’t rely on the data, you have to keep trying to change the largest things and hope you get a meaningful and business shaping impact.

Remember that your opinion is always the most limiting factor, and the smaller you are the more you need to go past your comfort zone.”

Here’s Craig’s take on opportunity costs:

Craig Sullivan, Optimal Visit
“There are many arguments here about measuring all the small changes and not being able to separate these influences from background noise. From my work, site optimisation done with velocity and prioritised changes, as well as fixing bugs and broken stuff, will deliver a lift, whether you care to measure the atomic impacts or not.

My advice – if you’re not continually improving the key metrics you live by or are optimising for then does it matter whether you can measure each tiny impact?

This is why I batch changes or bugs onto a site or optimise 30 page templates rather than one – because I can clearly see impact from the compounding nature of the changes I’ve made.

Fix 235 browser bugs and you’ll lift conversion – do you need to measure or test each one? Hell no.

Take the lift, move on and keep testing faster and better, with quality hypotheses. Aiming for velocity and prioritised volume testing will get you places that endless over analysis will not.

Then again, optimization is more than just a/b testing and lifts. Matt Gershoff, CEO of Conductrics, put it well, saying part of it is about “gathering information to inform decisions.” In other words, optimization is about reducing uncertainty, and therefore risk aversion, in decision making. So you have to factor in everything else you gain from conversion optimization.

Craig also mentioned that conversion optimization isn’t just about the testing. It’s about the big picture:

Craig Sullivan, Optimal Visit
“You’re just running a bunch of tests whose expected output is valuable knowledge for the business – it isn’t about the testing, it’s about setting it up for learning. The quality of the work comes from forming test hypotheses and test programs that drive learning – not short term impacts. It’s like creating Intellectual Property for your business out of customer data.”


Measuring ROI is hard. But there are a few ways to do it.

There are some statistically rigorous methods of calculating impact (GA Effect, weak causal analysis), and even though time period comparison analysis is generally wrong (due to non-stationary data), as Craig mentioned, there are a few exceptions when you can get a rough estimate (if you have stable and controllable traffic, like with PPC – though you might not be able to draw overall conclusions this way.). Finally, one of the most common answers I found was to send a consistent amount of traffic to a small holdback set.

Keep in mind, too, that when done correctly, optimization and the insight you gain can be used in all of your marketing. It’s a process that leads to information that informs better decisions, so the return on investment compounds with the customer insight you gain.