Partial outage of integration webhooks
Incident Report for GoCardless
Postmortem

Webhooks not delivered

28th July 2023

Summary

On Friday 21st July at 12:25 UTC a code fault led to all webhook endpoints being disabled for partners and merchants, preventing the delivery of all webhook notifications.

The affected webhook endpoints were re-enabled on 23rd July at 10:01 UTC, at which point webhook notifications resumed. The webhook notifications for events that had been missed during the outage were sent to integrators between 17:00 and 19:00 UTC on 23rd July.

We understand that webhook notifications are a critical part of many partners’ and customers’ workflows and we apologise for the disruption caused as a result of this incident.

Our production environment was affected, our sandbox environment was not affected.

Payments, uptime and email notifications were unaffected by this incident.

Root Causes

A code fault was introduced during a routine change to an internal task that closes unused accounts. The task disables all webhook endpoints on the account that is being closed. As part of the change we introduced a fault that disabled all webhook endpoints in the database rather than just the webhook endpoints associated with the account. Our automated tests did not test this scenario and so did not catch the fault. The faulty code change was deployed at 11:57 UTC on 21st July and the task was next run at 12:25 UTC on the same day, which triggered the outage.

A gap in our monitoring meant that we were not alerted that we had stopped sending webhooks. Our events infrastructure is used to trigger webhooks, emails and various internal processes. The events infrastructure continued to work as expected throughout the incident and only webhooks were affected. The vast majority of our monitoring is on the events pipeline and we did not have sufficient monitoring on the specific webhook sending component.

Remedies

We became aware of the problem late on Saturday 22nd July. Our on-call teams promptly escalated the incident and the necessary domain experts joined the incident response early on the morning of Sunday 23rd July. We were able to quickly debug the issue, revert the fault, and restore the disabled webhook endpoints. This resumed webhook notifications for all integrators at 10:01 UTC on 23rd July.

Once we had resumed notifications we identified the events that had not been sent as a result of the outage. We sent webhooks notifications containing these events to integrators between 17:00 and 19:00 UTC on 23rd July. This resolved the incident.

As a result of this incident we are making a number of changes:

  • We have added automated tests to protect against this fault reoccurring in the task to close accounts.
  • We have added new alerts on webhook throughput metrics that would have identified this issue much sooner.
  • We are reviewing our escalation procedures to ensure our on-call engineers are able to escalate to the necessary domain experts at all hours.
  • We are investigating stronger controls on database queries that are not scoped to individual customer accounts.

Timeline

All times in UTC

2023-07-21

  • 11:57 We merged and deployed the change containing the code fault
  • 12:25 We ran the task containing the fault that triggered the outage

2023-07-22

  • 23:25 We were alerted to the issue and our on-call team updated the status page

2023-07-23

  • 07:18 We discovered the root cause of the incident
  • 08:06 We reverted the fault that caused the outage
  • 10:01 We reactivated the disabled webhook endpoints and webhook delivery resumed
  • 17:00 We started sending webhook notifications for events that were missed during the outage
  • 19:00 We completed sending webhook notifications for events that were missed during the outage
Posted Jul 31, 2023 - 10:20 BST

Resolved
The issue has been resolved. Our monitoring shows that webhooks have been stable since yesterday. We have successfully resubmitted all missed events.
Posted Jul 24, 2023 - 16:54 BST
Update
The replay of missed events is now complete. We will conduct some checks to verify that it ran correctly.
Posted Jul 23, 2023 - 19:48 BST
Monitoring
We kicked off a replay of missed events. This started around 17:00 UTC (18:00 BST). Integrator webhooks are now getting invoked with events starting from Friday when the incident began.
Posted Jul 23, 2023 - 19:04 BST
Update
The issue has been fixed. Webhooks are now operational. New event notifications are getting delivered via webhooks. We are working on replaying/resubmitting all missed events since the incident began on Friday.
Posted Jul 23, 2023 - 11:43 BST
Identified
We experienced an issue where roughly 3/4 of webhook notifications were not delivered since Friday afternoon. The cause of the issue has been identified. A fix in progress.
Posted Jul 23, 2023 - 09:36 BST
Update
Our engineers are still investigating the issue. A large number of webhook notifications are still not getting sent.

All other components are operational. Payments and mandates are unaffected by this incident. Events can still be queried via the API.
Posted Jul 23, 2023 - 02:30 BST
Update
We are continuing to investigate. The issue started on Friday 21 July around 13:30 UTC.
Posted Jul 23, 2023 - 01:04 BST
Investigating
We are experiencing issues with our integration webhooks. Our engineers are investigating the problem.
Posted Jul 23, 2023 - 00:25 BST
This incident affected: Payment Pages (Live, Sandbox), API (Live, Sandbox), and Dashboard (Live, Sandbox).