On the 26th May 2022, we experienced elevated error rates on some of our core APIs over two time periods: 09:05 - 09:30 and 09:44 - 09:48 UTC.
The affected endpoints were:
This downtime was caused by a change we made to add extra validation to our event publishing. We generate events whenever a resource has been updated, for example a payment which has been collected. More details on events can be found here.
The first period ended when we fixed the root cause. The second period was caused by our continuous integration pipeline deploying an older version of the code, temporarily re-introducing the issue; this was possible because the initial deployment of the fix was done manually. This rectified itself once the latest version was auto-deployed.
We appreciate that these APIs are essential to our customers’ operations and apologise for any disruption caused by this incident.
This year we started a company-wide objective to future-proof our systems as we continue along our growth trajectory. A project we worked on was improving the rate at which we process payments in our payment pipeline, one aspect of this was increasing the throughput of event publishing.
On the morning of 26th May, we released a change to add extra validation to our event publishing. The validation was stricter than necessary, which caused all validation to fail for payment events.
We emit an event every time we change the state of a payment, and as a consequence of the failed validation any requests made to payment related APIs resulted in HTTP 500 errors.
We started receiving alerts for failures on some core APIs at 09:02 UTC. We reverted the change and manually deployed the fix at 09:34.
Our CI pipeline then released an older revision, which was already in our deployment queue before we manually deployed the fix and caused a further outage from 09:44 - 09:48 UTC.
As a result of this incident we are making the following changes:
All times in UTC