28th July 2023
On Friday 21st July at 12:25 UTC a code fault led to all webhook endpoints being disabled for partners and merchants, preventing the delivery of all webhook notifications.
The affected webhook endpoints were re-enabled on 23rd July at 10:01 UTC, at which point webhook notifications resumed. The webhook notifications for events that had been missed during the outage were sent to integrators between 17:00 and 19:00 UTC on 23rd July.
We understand that webhook notifications are a critical part of many partners’ and customers’ workflows and we apologise for the disruption caused as a result of this incident.
Our production environment was affected, our sandbox environment was not affected.
Payments, uptime and email notifications were unaffected by this incident.
A code fault was introduced during a routine change to an internal task that closes unused accounts. The task disables all webhook endpoints on the account that is being closed. As part of the change we introduced a fault that disabled all webhook endpoints in the database rather than just the webhook endpoints associated with the account. Our automated tests did not test this scenario and so did not catch the fault. The faulty code change was deployed at 11:57 UTC on 21st July and the task was next run at 12:25 UTC on the same day, which triggered the outage.
A gap in our monitoring meant that we were not alerted that we had stopped sending webhooks. Our events infrastructure is used to trigger webhooks, emails and various internal processes. The events infrastructure continued to work as expected throughout the incident and only webhooks were affected. The vast majority of our monitoring is on the events pipeline and we did not have sufficient monitoring on the specific webhook sending component.
We became aware of the problem late on Saturday 22nd July. Our on-call teams promptly escalated the incident and the necessary domain experts joined the incident response early on the morning of Sunday 23rd July. We were able to quickly debug the issue, revert the fault, and restore the disabled webhook endpoints. This resumed webhook notifications for all integrators at 10:01 UTC on 23rd July.
Once we had resumed notifications we identified the events that had not been sent as a result of the outage. We sent webhooks notifications containing these events to integrators between 17:00 and 19:00 UTC on 23rd July. This resolved the incident.
As a result of this incident we are making a number of changes:
All times in UTC