Previously we have written about how we adopted the React Native New Architecture as one way to boost our performance. Before we dive into how we detect regressions, let’s first explain how we define performance.
In browsers there is already an industry standard set of metrics to measure performance in the Core Web Vitals, and while they are by no means perfect, they focus on the actual impact on the user experience. We wanted to have something similar but for apps, so we adopted App Render Complete and Navigation Total Blocking Time as our two most important metrics.
We still collect a slew of other metrics – such as render times, bundle sizes, network requests, frozen frames, memory usage etc. – but they are indicators to tell us why something went wrong rather than how our users perceive our apps.
Their advantage over the more holistic ARC/NTBT metrics is that they are more granular and deterministic. For example, it’s much easier to reliably impact and detect that bundle size increased or that total bandwidth usage decreased, but it doesn’t automatically translate to a noticeable difference for our users.
In the end, what we care about is how our apps run on our users’ actual physical devices, but we also want to know how an app performs before we ship it. For this we leverage the Performance API (via react-native-performance) that we pipe to Sentry for Real User Monitoring, and in development this is supported out of the box by Rozenite.
But we also wanted a reliable way to benchmark and compare two different builds to know whether our optimizations move the needle or new features regress performance. Since Maestro was already used for our End to End test suite, we simply extended that to also collect performance benchmarks in certain key flows.
To adjust for flukes we ran the same flow many times on different devices in our CI and calculated statistical significance for each metric. We were now able to compare each Pull Request to our main branch and see how they fared performance wise. Surely, performance regressions were a thing of the past.
In practice, this didn’t have the outcomes we had hoped for a few reasons. First we saw that the automated benchmarks were mainly used when developers wanted validation that their optimizations had an effect – which in itself is important and highly valuable – but this was typically after we had seen a regression in Real User Monitoring, not before.
To address this we started running benchmarks between release branches to see how they fared. While this did catch regressions, they were typically hard to address as there was a full week of changes to go through – something our release managers simply weren’t able to do in every instance. Even if they found the cause, simply reverting often wasn’t a possibility.
On top of that, the App Render Complete metric was network-dependent and non-deterministic, so if the servers had extra load that hour or if a feature flag turned on, it would affect the benchmarks even if the code didn’t change, invalidating the statistical significance calculation.
We had to go back to the drawing board and reconsider our strategy. We had three major challenges:
The solution to the precision problem was simple; we just needed to run the benchmarks for every merge, that way we could see on a time series graph when things changed. This was mainly an infrastructure problem, but thanks to optimized pipelines, build process and caching we were able to cut down the total time to about 8 minutes from merge to benchmarks ready.
When it comes to specificity, we needed to cut out as many confounding factors as possible, with the backend being the main one. To achieve this we first record the network traffic, and then replay it during the benchmarks, including API requests, feature flags and websocket data. Additionally the runs were spread out across even more devices.
Together, these changes also contributed to solving the variance problem, in part by reducing it, but also by increasing the sample size by orders of magnitude. Just like in production, a single sample never tells the whole story, but by looking at all of them over time it was easy to see trend shifts that we could attribute to a range of 1-5 commits.
As mentioned above, simply having the metrics isn’t enough, as any regression needs to be actioned quickly, so we needed an automated way to alert us. At the same time, if we alerted too often or incorrectly due to inherent variance, it would go ignored.
After trialing more esoteric models like Bayesian online changepoint, we settled on a much simpler moving average. When a metric regresses more than 10% for at least two consecutive runs we fire an alert.
While detecting and fixing regressions before a release branch is cut is fantastic, the holy grail is to prevent them from getting merged in the first place.
What’s stopping us from doing this at the moment is twofold: on one hand running this for every commit in every branch requires even more capacity in our pipelines, and on the other hand having enough statistical power to tell if there was an effect or not.
The two are antagonistic, meaning that given that we have the same budget to spend, running more benchmarks across fewer devices would reduce statistical power.
The trick we intend to apply is to spend our resources smarter – since effect can vary, so can our sample size. Essentially, for changes with big impact, we can do fewer runs, and for changes with smaller impact we do more runs.
By combining Maestro-based benchmarks, tighter control over variance, and pragmatic alerting, we have moved performance regression detection from a reactive exercise to a systematic, near-real-time signal.
While there is still work to do to stop regressions before they are merged, this approach has already made performance a first-class, continuously monitored concern – helping us ship faster without getting slower.
The post Preventing mobile performance regressions with Maestro appeared first on Kraken Blog.