Scaling Findmypast for 1921 Census release - Part 2

Findmypast (FMP) released the 1921 Census of England & Wales at midnight on Jan 6th 2022 to an eager community of genealogists. The preparation of the census - preservation, digitisation and transcription took three years of hard work. Aside from the data preparation, we also had technical challenges to address. Specifically, could our services deal with the projected increase in users for the first few days of 1921 launch period?

This post is the second in a series that details how we approached scaling our service to deal with a projected day one 12x increase of users. If you haven’t already, I’d suggest you read the first post which details the initial scaling steps taken by our engineering teams.

Stress testing “game” days

We ended the first post detailing that the load tests created by the teams were effective at testing the service in isolation. The next step was to get all the teams together and run their load tests at the same time against our production service. While still not a realistic example of a user journey, it would still test how our systems as a whole responded to the increased load.

Game day # 1

Initially, we set a simple schedule of a load test “game day” each month. Scheduled for Thursday 8th April and spread over two hours we planned to run 3 load tests. Each was 20 mins in duration and we aimed for the first test to add 2x our normal load, the next test was 3x and the final test was 4x the load. Note that “normal load” means the load we would normally expect in a typical January. (1921 Census was launched in January). We have seasonal visitor patterns with more visitors during (the northern hemisphere) winter months than summer. When we set our targets for the load test, we aimed for 4x the normal January load, not the current April load.

Overall, the first stress test game day was a success - the site stayed alive and generally responsive. But we did learn a few things from the first day:

One of our key services that deals with GraphQL queries failed its Service Level Objectives (SLO) targets and crashed under load. By the 4x test we had tweaked the Kubernetes (K8s) deployment configurations to help stabalise the service, but more work was required to investigate the performance issues. An action that came from this was a daily automated stress test to help the team diagnose both performance issues and K8s configuration issues.
Various other services also identified performance issues that needed to be addressed before the next test.
Searching is driven by SOLR - during the 4x test we didn’t reach the throughput we were expecting.
As mentioned in the previous post, Antracks service struggled with its SLO targets. A re-design was underway, but this is a central service so its slow response times drove up latency throughout all the other microservices that relied on it.
Services that relied on Horizontal Pod Autoscaling (HPA) performed badly. During the tests, we manually increased the number of the pods to help deal with the increased load. In retrospect, we realised that the shortness of the test (20 mins) meant that the HPA couldn’t react fast enough with the rapid increase in requests. Services were overloaded for the few mins it took the HPA to kick in, resulting in slow response times. In some cases, the health probes on these overloaded pods didn’t respond in time, resulting in K8s helpfully killing the unresponsive pod, thereby adding more load to the remaining (overloaded) pods! For the launch of 1921 census, we didn’t rely on HPA at all and simply created the number of pods we felt was required for each service, to deal with the load.

Game day #2

Next game day was held on 3rd June with the aim to run tests at 4x, 5x and finally 6x the normal January load. Again, the load test was a success, and although we saw some slowness in some of the services, search 99th percentile response times weren’t too bad, but then again not great either. Teams could still see that services needed performance improvements. Throughput on the search load test didn’t quite hit the 6x throughput we were aiming for. This struggle to hit the throughput became a common problem for all services throughout the remaining tests.

Game day #3

Here we followed the same approach as the previous games days - start at the level we reached at the previous test and perform three tests to increase the load. In this case, we started at 6x, then 7x and finally 8x. The load test was scheduled for the start of July.

The first load test - the 6x - didn’t perform well. We had major issues on the site caused by K8s worker nodes becoming unavailable and pods crashing. This behaviour was unexpected - we had already run a successful 6x load test previously - so a sudden failure like this was concerning. Analysis of the failed worker nodes and crashing pods appeared to point the finger at our virtualisation hardware. Findmypast website does not run on cloud services but on dedicated hardware in our data centres. We use Hyper-V to manage both Windows and Linux Virtual Machines on a cluster of Hyper-V servers. Prior to the test, new hardware was brought into the data centre to increase our capacity. During this test, some of the K8s worker nodes were running on the new hardware, with of course a different version of the Hyper-V server running on the new cluster. We migrated the worker nodes back to the old cluster and attempted the test again.

This time the test performed better, and we managed a 8x increase in load. But we did start to see a lot of cracks appearing in the services:

Clearly, something wasn’t quite right with the new Hyper-V cluster. That needed digging into.
A few services were hitting rate limits on some of the external 3rd party APIs we rely on. (We test on production, so real rate limits were hit.) Some changes in code helped, better caching for example, but we also discussed temporary rate limit increases during the load tests with our 3rd parties which also helped.
Our monitoring solution started to struggle. We lost some of the instrumentation data during the 8x test.
The re-design of the Antracks service was taking shape but still wasn’t ready. The increase in load demonstrated just how bad Antracks was at scaling. Worse, due to its design, it contributed to more load on services. This central service was having a detrimental affect on the other services.
Legacy code needed attention, some of this code hadn’t been touched in years.
Throughput for services still didn’t meet the targets. In hindsight, we should have realised that the throughput problems was also showing us that latency was increasing throughout the system, thereby slowing down the number of requests a service could handle and also slowing down the load test tooling - it was waiting on responses. To hit the throughput targets, we would need to massively increase concurrency of the tests.
On the plus side, extra hardware for the SOLR cluster meant that the search load tests were solid.

Game day #4…

This game day the objective was to hit 10x January load. It was the August tests that we started to see real problems with the performance of the site at load. We struggled to get past 9x load. We continued to see network timeouts and slow response times. The load test was affecting the live site, with users experiencing errors when viewing search results, record transcripts and record images.

To make matters worse, Kubernetes worker nodes failed and services running on the worker nodes also failed, majorly contributing to the overall site stability. Again, the suspicion was the new Hyper-V cluster was affecting the VMs in some unknown way.

From a SOLR search point of view, the SOLR servers were hardly being stressed. By this point, we had also extended the SOLR clusters into AWS - giving us extra capacity when required. But, the overall latency problems and lack of throughput meant that we couldn’t load test search capacity as much as we would have liked.

The network timeouts were frustrating. Some services failed to talk to another service, even though the receiving service reported back as up and responding quickly. Our instrumentation highlighted major latency differences between services. Service A - calling Service B - would report a lot of timeouts and long response times, while Service B reported that, from its point of view, it responded rather quickly. Some services didn’t have any issues talking to service X, while other services reported timeouts to the same endpoint.

These timeouts happened regardless of whether a service was located inside or outside of K8s; calls to services inside the K8s cluster or from outside the cluster into K8s were equally affected.

Frustratingly, teams did not experience these timeouts when running their load tests in isolation, only when the load tests were run together did we start to see these network issues.

We continued running the load tests during September and October. In October we moved some of the load tests from our internal tools onto Blazemeter. The latencies reported by BlazeMeter were worse than those reported by the internal tooling. Our Antracks service was still slowing things down and the redesign was taking longer than expected. We could also see that the K8s worker nodes performance suffered when running on the new Hyper-V hardware. Things were looking grim.

Conclusion

By September, the plan was to have demonstrated a stable site capable of dealing with 12x the January load. By the end of the October, we still faced the same network timeouts and throughput problems. By this point, it was looking increasingly like an infrastructure issue, rather than a coding / performance problem. Attention moved to the Hyper-V servers. The next blog post will continue the story…

Get in touch

We are always looking for engineers to join our team so if you’re interested contact us or check out our current vacancies.