Incorrect communications distributed to merchants via email
Incident Report for GoCardless
Postmortem

Incorrect communications sent to merchants via email

24th May 2022

Summary

On 24th May a large number of incorrect emails were sent to merchants between 09:00 UTC and 10:38 UTC. The emails contained incorrect notifications of:

  • Mandate failed
  • Mandate cancelled
  • Payment failed
  • Payment cancelled
  • Payment charged back

The incorrect emails caused significant confusion for many of our merchants and resulted in our support teams handling a much larger volume of queries than usual, which in turn led to delays in responding to our customers.

We understand that payment notification emails are a critical part of many customers’ workflows and we apologise for the disruption caused as a result of this incident.

The incorrect emails were stopped at 10:38 UTC on 24th May. We notified all affected merchants of the error and sent corrected notifications by:

  • English locales: 25th May 15:30 UTC
  • French and German locales: 26th May 16:20 UTC

Both our sandbox and production environments were affected.

Payments, uptime and payer emails were unaffected by this incident.

Root Causes

As part of efforts to support increasing transaction volumes, we are making changes to our infrastructure for handling payment events.

On the afternoon of 23rd May, we updated the job responsible for sending daily payment notification emails to use the new events infrastructure. This change contained a bug where the dates that specify the time period for which to send notifications were ignored.

The job ran next on the morning of 24th May. Due to the bug, the job did not get the event count for the last 24 hours but instead returned the total historic counts for each event. This resulted in merchants receiving incorrect and misleading emails.

Remedies

We stopped sending emails at 10:38 UTC. We removed all incorrect pending emails from the queue. These actions prevented any further impact on merchants.

We reverted the change that contained the bug at 11:32 UTC. This ensured the job ran correctly the next day (25th May).

A large number of merchants were affected by this incident and we wanted to be certain we sent the correct data in our follow up communications. Therefore it took us some time to put together an individual response for each merchant. We notified all affected merchants of the error and sent corrected notifications by:

  • English locales: 25th May 15:30 UTC
  • French and German locales: 26th May 16:20 UTC

As a result of this incident we are making a number of changes:

  • Changing the default behaviour of our events queries to require start and end dates
  • Adding validations so unexpected parameters cause errors rather than being ignored
  • Running the new infrastructure and old infrastructure in parallel for a period of time to check they are behaving identically before switching traffic to the new. In the past we have used this approach on a case by case basis. As a result of this incident we have made it the standard for doing migrations at GC.

Timeline

All times in UTC

2022-05-23

  • 15:01 We merged and deployed the change containing the bug

2022-05-24

  • 09:00 The daily payment notification job started running in sandbox
  • 10:00 The daily payment notification job started running in production
  • 10:02 We were alerted that the job was sending incorrect emails
  • 10:38 We stopped sending any further incorrect emails
  • 11:32 We reverted the change that caused the problem

2022-05-25

  • 09:00 The daily payment notification job started running correctly in sandbox
  • 10:00 The daily payment notification job started running correctly in production
  • 15:30 We notified all affected English speaking merchants

2022-05-26

  • 16:08 We notified all affected German speaking merchants
  • 16:20 We notified all affected French speaking merchants
Posted Jun 09, 2022 - 11:28 BST

Resolved
Our monitoring shows that our services have stayed stable following the fix we applied.

We do not expect any additional disruption.

We will be following up by communicating with affected merchants.
Posted May 24, 2022 - 17:29 BST
Monitoring
Our engineers have identified the root cause and applied a fix.

We are continuing to monitor the situation.
Posted May 24, 2022 - 16:14 BST
Identified
We have removed the incorrect emails from our outgoing queues and restarted the regular flow of emails.

Payments, Email notifications and communications to payees remain unaffected by this incident.

We are continuing to investigate the root cause and will provide further updates to affected merchants.
Posted May 24, 2022 - 12:50 BST
Investigating
Our engineers have identified an issue which has caused incorrect communications related to payments and mandates cancellations to be sent via email to merchants.

We have stopped further emails from going out until we fully understand the impact of the problem.

Our engineering team is working on identifying the source of the issue and will provide further details to affected merchants.
Posted May 24, 2022 - 12:16 BST
This incident affected: API (Live, Sandbox), Payment Pages (Live, Sandbox), Payment Processing (Reporting), and Dashboard (Live, Sandbox).