Delayed webhook notifications
Incident Report for GoCardless
Postmortem

Summary

On 1st March 2022 a number of production webhooks were delayed between 12:35 and 17:57 UTC. The webhooks were all retried successfully by 19:11. The maximum delay was 6 hours 36 minutes. This resulted in a delay in customers receiving notifications from GoCardless, including payment updates.

A smaller number of webhooks were affected in our sandbox environment. This resulted in some customers being unable to test their integrations.

Root Causes

As part of efforts to support increasing transaction volumes, we are making some changes to our webhook delivery infrastructure. 

Previously, webhook records were created at the same time as we attempted to send them to integrator webhook endpoints.

We recently split this into two separate processes:

  1. Persist new webhook records
  2. Deliver webhook requests to integrators’ endpoints

We rolled this change out on 1st March 12:35 UTC to a portion of webhook traffic, and some of the webhooks got stuck between the persist and deliver processes due to a bug in the new delivery process.

This meant that no delivery attempts were made for some webhooks. Those webhooks wrongly appeared as timed out on the merchant dashboard.

All webhooks that went through the existing infrastructure were delivered correctly.

Although we have monitoring on webhook creation and delivery, coupled with alerts on undelivered webhooks, a gap in our monitoring approach meant that the affected webhooks were not identified as undelivered.

Remedies

We stopped the new infrastructure from processing webhooks at 17:57 UTC. This prevented any more webhooks from failing. We identified the webhooks that had failed to send and retried them successfully by 19:11 UTC.

Off the back of this incident, we have introduced additional testing to prevent this kind of failure. We have improved our monitoring so we are alerted earlier if webhooks fail for any reason. We are confident that this should not reoccur in the future.

We understand that webhooks are a critical part of the service we provide and we are reviewing how we communicate with customers if and when our webhooks are degraded.

Timeline

All times in UTC

2022-03-01

  • 12:34 We started sending some traffic to the new webhook infrastructure
  • 15:49 We were first alerted that webhooks were not being processed
  • 17:57 New webhook infrastructure turned off in production
  • 18:10 All live delayed webhooks identified
  • 19:11 All live delayed webhooks successfully retried

2022-03–02

15:37 All sandbox delayed webhooks identified and successfully retried

Posted Mar 07, 2022 - 16:12 GMT

Resolved
A number of webhooks were delayed between 12:35 and 17:57. The webhooks were all retried successfully by 19:07. The maximum delay was 6 hours 32 minutes. A full post mortem will follow.
Posted Mar 01, 2022 - 12:30 GMT