API, Payment Pages and Dashboard Outage

Incident Report for GoCardless

Postmortem

Summary

On the 26th May 2022, we experienced elevated error rates on some of our core APIs over two time periods: 09:05 - 09:30 and 09:44 - 09:48 UTC.

The affected endpoints were:

This downtime was caused by a change we made to add extra validation to our event publishing. We generate events whenever a resource has been updated, for example a payment which has been collected. More details on events can be found here.

The first period ended when we fixed the root cause. The second period was caused by our continuous integration pipeline deploying an older version of the code, temporarily re-introducing the issue; this was possible because the initial deployment of the fix was done manually. This rectified itself once the latest version was auto-deployed.

We appreciate that these APIs are essential to our customers’ operations and apologise for any disruption caused by this incident.

Root Causes

This year we started a company-wide objective to future-proof our systems as we continue along our growth trajectory. A project we worked on was improving the rate at which we process payments in our payment pipeline, one aspect of this was increasing the throughput of event publishing.

On the morning of 26th May, we released a change to add extra validation to our event publishing. The validation was stricter than necessary, which caused all validation to fail for payment events.

We emit an event every time we change the state of a payment, and as a consequence of the failed validation any requests made to payment related APIs resulted in HTTP 500 errors.

Remedies

We started receiving alerts for failures on some core APIs at 09:02 UTC. We reverted the change and manually deployed the fix at 09:34.

Our CI pipeline then released an older revision, which was already in our deployment queue before we manually deployed the fix and caused a further outage from 09:44 - 09:48 UTC.

As a result of this incident we are making the following changes:

In the future for any disruptions to our core APIs we will be reaching out to all known affected merchants via email.
For any new features / changes that are on the payments critical path, we will carry out more rigorous testing in staging environments using feature flags / conditional environment flags and ensure a rollback strategy is in place.

Timeline

All times in UTC

2022-05-26

09:01 Deployed a change to add additional schema validation to events.
09:02 We received an alert that the schema validation had failed for events.
09:34 Reverted the change and manually deployed.
09:44 CI pipeline released an old change containing the bug.
09:48 CI pipeline released a change with the fix.

Posted Aug 30, 2022 - 10:37 BST

Resolved

Our monitoring shows that our services have stayed stable following the fix we applied. Payment creation and cancellation is now functional in all environments.

We do not expect any additional disruption.

Posted May 26, 2022 - 10:36 BST

Monitoring

Our engineers have applied a fix to the Live environment, and we are once again serving all API, dashboard, and payment page traffic.

The same fix is being rolled out to the Sandbox environment.

We are continuing to monitor the situation.

Posted May 26, 2022 - 10:31 BST

Investigating

We are currently experiencing elevated error rates when creating or modifying payments. This affects the API, Dashboard, and Payment Pages.

Our engineers are investigating and we'll provide updates here.

Posted May 26, 2022 - 10:27 BST

This incident affected: Payment Pages (Live, Sandbox), API (Live, Sandbox), and Dashboard (Live, Sandbox).