Revisiting A/B tests in SaaS 💯

The data blindspots that colour A/B testing for SaaS products, a week of divergent yet focused building, and cracking win-win goals with partners.

Apr 26, 2023

Welcome to the 71st edition of The SaaS Baton. A fortnightly newsletter that brings you hand-curated pieces of advice drawn from the thoughtful founder-to-founder exchanges and interviews taking place on Relay (curated with 💛 at Chargebee for Startups) and the interwebz. So, stay tuned!

In this edition, you’ll find the following instructive and inspiring pickings:

#1: Patch’s co-founder and CEO, Whelan Boyd, tallies up the inescapable hurdles — aside from not having enough users — of running A/B tests on SaaS products.

#2: Buffer’s co-founder and CEO, Joel Gascoigne, looks back on a more org-wide, inclusive iteration of a hack week that brought about a timely probing of old ways of operating.

#3: Gusto’s co-founder and CEO, Joshua Reeves, talks about identifying, enabling, and scaling (direct and indirect) network effects in their partner program.

🗞 Recently on Relay:

Heuristics and Hunches (April 14th) — “The AI Hype Can Accelerate Your Growth, Not Create It” and Other Notes on Making a Defensible, AI-First, B2B Bet with Question Base’s Co-Founder, Yana Vlatchkova

— Sequencing AI just right in the MVP process
— How the AI hype has aided adoption (and the customer-first next steps)
— Weaving in defensibility when the big players have most structural gains
— Doubling down on a singular advantage all founders share

#1: Revisiting A/B tests in SaaS

(From: Patch’s Whelan Boyd) (Source: Twitter)

I spent 6 years building A/B testing tools at Optimizely and yet…
I love this part of the UX Design req from Linear’s founder
“hate a/b testing and want to craft”
There are many challenges running A/B tests on SaaS products that are meaningfully different from unauthenticated websites.
And it’s not just having enough users to get statistically significant results.
Here’s a few from UX, Data Quality & Experiment Design, and Process.
1/ Unchecked A/B testing can degrade the User Experience, outweighing the theoretical gains of the experiments.
E.g. If the “Create Issue” button keeps moving around or the list view sorting defaults keep changing, users get annoyed.
2/ If you’re comfortable with the UX risks or they don’t apply, Data Quality is your next thing to tackle.
N.B. I use the term broadly to include statistical rigor and decision-quality
To compute metrics like “tickets created” or “time to do X thing”, you first collect events.
Lots of mechanical landmines - delayed events on mobile bc of connection issues, right vs left clicks, keyboard shortcut navigation vs clicks
My suggestion - pick your tools for event collection and analytics and use only them exclusively.
It’s a hassle matching raw event counts across systems, let alone sessionization behavior, attribution models, and even basic metric definitions.
One of the biggest headaches at Optimizely was Data Trust.
When customers already send events to an analytics tool/data warehouse, it’s ~impossible to reconcile even basic counts.
We spent vast resources building products & processes on this.
3/ Data Quality is also a result of Experiment Design.
One challenge in SaaS products is that there’s often a mismatch between the unit of randomization and the unit of conversion.
For example, in a product like Linear, you’d want all users within an account or project to see the same experience.
Otherwise, you’d confuse users or even run into bugs. Imagine screen-sharing with a teammate whose UI was completely different.
But often you want to measure things at the user level, eg # of tasks created.
Unfortunately this is a violation of randomized controlled trial design.
You’d introduce weirdness like accounts with tons of users over-influencing metrics.
Another challenge is simply applying the treatment experience effectively.
Client-side solutions run JS to modify HTML as page loads. SPAs pose a challenge, though we solved this well at Optimizely.
Code-based solutions -> way to go.
4/ You must select Metrics intentionally.
It’s important to limit number of metrics to preserve statistical power.
On a landing page or e-comm funnel, this is straightforward - drive signups, purchases, or avg cart value.
SaaS products may care about a more holistic user experience and therefore a longer list of metrics.
It’s tempting to measure all the things.
While there are statistical methods, such as Bonferroni Correction, to account for this, it’s not foolproof and it’s more work.
SaaS tools also care more about long-term value creation, as that leads to renewal & expansion.
Even if this can be quantified by a composite set of metrics, many of them will only be computable with some delay, like effect on revenue expansion.
5/ One way to quantify the UX annoyance above is the Novelty Effect.
For frequent users, any UX change will like cause a change in behavior.
Some orgs like Pinterest segment their population by “X Days since exposure” to control for this.
6/ Another challenge is dealing with Interaction Effects.
That is, when a user is included in multiple experiments, you must use analysis methods to detect the degree to which the combination impacted behavior non-additively.
7/ Oft-overlooked, A/B tests can strain Internal Processes.
Imagine a CS rep answering an email and with no idea if the user is in the treatment of the “left nav redesign” test?
What if a customer wants to be contractually excluded from all A/B tests?
With all this said, it’s pretty cool to see teams like Dropbox dealing with specific problems like “delayed metrics”
Some of the best product teams in the world run thousands of experiments every year at massive scale.

Share Whelan's insight

#2: Build Week

(From: Buffer’s Joel Gascoigne) (Source: Joel.Is)

With the past week dedicated to Build Week at Buffer, it’s something that’s very fresh in my mind, and I have a number of reflections that have been forming over the weekend:
…
* “Build Week has been truly phenomenal and probably my favourite week in the 347 weeks I’ve spent at Buffer!”
* We designed Build Week to be different from a typical hack week (we’ve done many of these in the past, and found them super valuable):
* The entire goal of Build Week was framed as: creating (and shipping in some form) value, in the space of a week.
* With a traditional hack week, it was more of a time for engineers to work on what they desire, outside of the regular roadmap.
* The work in traditional hack weeks was usually focused on refactoring, bugs or small features that people didn’t feel like they had time to fit in, or just a very raw prototype of some future functionality.
* In contrast, Build Week projects were super varied, and many of them were ambitious.
* We didn’t put very much process in place. The high level guidance of ‘creating and shipping value,’ combined with the constraint of ‘within a week’ led to a ton of creativity and drive.
* In retrospect, I think we created great, simple ‘rules of the game,’ but left a ton of freedom in the how.
* This is one of my biggest takeaways of what we achieved with Build Week: the power of minimal clear constraints
* One of the significant successes of Build Week was also that we focused on bringing together people who don’t generally work together. Each team had good representation of different functions.
* We only had two required deliverables: a 2 minute video at the end of day 2, and a 4 minute video to wrap up the project on day 4 (we work a 4-day workweek). We also had a new Slack channel called #build-week for everyone to share these videos, and for general chatter and advice requests. This was the right amount of deliverables to drive some whole company connection and celebration, while allowing teams to get deep into their project.
* One of my goals was that Build Week instills a new sense of creativity, innovation and comfort with uncertainty in the team, as well as reveals how productive we can be in focused small groups. The energy during Build Week was incredible, I have no doubt Build Week will be talked about regularly for months into the future, and that we will be discussing what learnings we can take from the week into our regular work going forward.
* Personally, I had a blast. I worked in a team of 4 and we build a new page on our marketing site to showcase how distributed we are. As part of the project, I jumped back into some coding, learned much more about React and our marketing site architecture and infrastructure (e.g. Lambda functions), as well as how to create a PR and deploy our marketing site. I feel a new level of confidence to make quick fixes and changes to our marketing site and set them up for review and deployment by someone in the team. Check out the new Team Map page we built.
* Stay tuned in the coming weeks and months as we share all the projects that were worked on in Build Week and ship some awesome new functionality in Buffer that was built during the week.

Share Joel's insight

#3: Gusto’s indirect growth channel

(From: Gusto’s Joshua Reeves) (Source: B2B a CEO)

If I think back to the earlier days, we really had two routes to market. It was a direct business. Really fueled by organic. We did the best we could in amplifying word of mouth. We started a content program which continues to pay dividends today.
Requires investment. It’s a slow build. But it becomes really valuable over time. We actually did some paid programmes as well but more to augment that core organic engine.
The other thing I’ll note here and this, hopefully, is a viable path for other companies too. Indirect…So we’ve continued to have a very robust indirect channel for Gusto.
Essentially it’s us working with accounting firms. When you’re thinking about your go-to-market [strategy], think about if there’s a stakeholder or someone in the mix going through their own type of transformation themselves.
In our case, there’s a lot of small accounting firms across the country. Many of them themselves had been one doing payroll by hand. Not really liking it. Definitely not a high margin business for them.
Kind of more like a cost of keeping and retaining a client.
So when we came to them and said, ‘hey, we’ll do that thing that you don’t want to do anyways and you can get the credit for it, you can set them [clients] on customers and refer and route them to Gusto; we’re going to help you focus on more value-add parts of your business.’
That’s now expanded really robustly.
We have a whole program called People Advisory where we can certify and train accounting firms and professionals on how to become a people advisor and help their customers with more value-add activities.
That could be a win-win.
That’s a network effect for us. Because on the direct side, we have these small businesses joining. And we have them add their accountants to Gusto and then those accounting firms now have one client on Gusto, we try to get more and more of that book of business on to Gusto.
And it’s thousands of accounting firms that work with us today, pretty closely.