On 1st March 2022 a number of production webhooks were delayed between 12:35 and 17:57 UTC. The webhooks were all retried successfully by 19:11. The maximum delay was 6 hours 36 minutes. This resulted in a delay in customers receiving notifications from GoCardless, including payment updates.
A smaller number of webhooks were affected in our sandbox environment. This resulted in some customers being unable to test their integrations.
As part of efforts to support increasing transaction volumes, we are making some changes to our webhook delivery infrastructure.
Previously, webhook records were created at the same time as we attempted to send them to integrator webhook endpoints.
We recently split this into two separate processes:
We rolled this change out on 1st March 12:35 UTC to a portion of webhook traffic, and some of the webhooks got stuck between the persist and deliver processes due to a bug in the new delivery process.
This meant that no delivery attempts were made for some webhooks. Those webhooks wrongly appeared as timed out on the merchant dashboard.
All webhooks that went through the existing infrastructure were delivered correctly.
Although we have monitoring on webhook creation and delivery, coupled with alerts on undelivered webhooks, a gap in our monitoring approach meant that the affected webhooks were not identified as undelivered.
We stopped the new infrastructure from processing webhooks at 17:57 UTC. This prevented any more webhooks from failing. We identified the webhooks that had failed to send and retried them successfully by 19:11 UTC.
Off the back of this incident, we have introduced additional testing to prevent this kind of failure. We have improved our monitoring so we are alerted earlier if webhooks fail for any reason. We are confident that this should not reoccur in the future.
We understand that webhooks are a critical part of the service we provide and we are reviewing how we communicate with customers if and when our webhooks are degraded.
All times in UTC
15:37 All sandbox delayed webhooks identified and successfully retried