Write-up published
Resolved
28th July 2023
On Friday 21st July at 12:25 UTC a code fault led to all webhook endpoints being disabled for partners and merchants, preventing the delivery of all webhook notifications.
The affected webhook endpoints were re-enabled on 23rd July at 10:01 UTC, at which point webhook notifications resumed. The webhook notifications for events that had been missed during the outage were sent to integrators between 17:00 and 19:00 UTC on 23rd July.
We understand that webhook notifications are a critical part of many partners’ and customers’ workflows and we apologise for the disruption caused as a result of this incident.
Our production environment was affected, our sandbox environment was not affected.
Payments, uptime and email notifications were unaffected by this incident.
A code fault was introduced during a routine change to an internal task that closes unused accounts. The task disables all webhook endpoints on the account that is being closed. As part of the change we introduced a fault that disabled all webhook endpoints in the database rather than just the webhook endpoints associated with the account. Our automated tests did not test this scenario and so did not catch the fault. The faulty code change was deployed at 11:57 UTC on 21st July and the task was next run at 12:25 UTC on the same day, which triggered the outage.
A gap in our monitoring meant that we were not alerted that we had stopped sending webhooks. Our events infrastructure is used to trigger webhooks, emails and various internal processes. The events infrastructure continued to work as expected throughout the incident and only webhooks were affected. The vast majority of our monitoring is on the events pipeline and we did not have sufficient monitoring on the specific webhook sending component.
We became aware of the problem late on Saturday 22nd July. Our on-call teams promptly escalated the incident and the necessary domain experts joined the incident response early on the morning of Sunday 23rd July. We were able to quickly debug the issue, revert the fault, and restore the disabled webhook endpoints. This resumed webhook notifications for all integrators at 10:01 UTC on 23rd July.
Once we had resumed notifications we identified the events that had not been sent as a result of the outage. We sent webhooks notifications containing these events to integrators between 17:00 and 19:00 UTC on 23rd July. This resolved the incident.
As a result of this incident we are making a number of changes:
We have added automated tests to protect against this fault reoccurring in the task to close accounts.
We have added new alerts on webhook throughput metrics that would have identified this issue much sooner.
We are reviewing our escalation procedures to ensure our on-call engineers are able to escalate to the necessary domain experts at all hours.
We are investigating stronger controls on database queries that are not scoped to individual customer accounts.
All times in UTC
2023-07-21
11:57 We merged and deployed the change containing the code fault
12:25 We ran the task containing the fault that triggered the outage
2023-07-22
23:25 We were alerted to the issue and our on-call team updated the status page
2023-07-23
07:18 We discovered the root cause of the incident
08:06 We reverted the fault that caused the outage
10:01 We reactivated the disabled webhook endpoints and webhook delivery resumed
17:00 We started sending webhook notifications for events that were missed during the outage
19:00 We completed sending webhook notifications for events that were missed during the outage
Resolved
The issue has been resolved. Our monitoring shows that webhooks have been stable since yesterday. We have successfully resubmitted all missed events.
Monitoring
The replay of missed events is now complete. We will conduct some checks to verify that it ran correctly.
Monitoring
We kicked off a replay of missed events. This started around 17:00 UTC (18:00 BST). Integrator webhooks are now getting invoked with events starting from Friday when the incident began.
Identified
The issue has been fixed. Webhooks are now operational. New event notifications are getting delivered via webhooks. We are working on replaying/resubmitting all missed events since the incident began on Friday.
Identified
We experienced an issue where roughly 3/4 of webhook notifications were not delivered since Friday afternoon. The cause of the issue has been identified. A fix in progress.
Identified
We experienced an issue where roughly 3/4 of webhook notifications were not delivered since Friday afternoon. The cause of the issue has been identified. A fix in progress.
Investigating
Our engineers are still investigating the issue. A large number of webhook notifications are still not getting sent.
All other components are operational. Payments and mandates are unaffected by this incident. Events can still be queried via the API.
Investigating
We are continuing to investigate. The issue started on Friday 21 July around 13:30 UTC.
Investigating
We are experiencing issues with our integration webhooks. Our engineers are investigating the problem.