This issue began at around midnight UTC on the 17th March, and was resolved at approximately 4am UTC, with full service restored by 4.45am.
This was an uncharacteristic outage, triggered by a spike in requests occurring shortly after midnight UTC, combined with an unforeseen issue with AWS ElastiCache, which led to critically high database connections and CPU utilisation. At this point, clients began receiving timeouts, and began retrying their requests at a high frequency, which further multiplied the issue [1].
By this stage, our auto-scaling infrastructure was not able to mitigate the issue and manual intervention was required. After investigating to ascertain the root causes, we rolled out a temporary solution which resolved the immediate bottlenecks. This included temporarily switching off API usage logging, and doubling our overall database capacity (logging has since been restored and clients' usage statistics are fully up-to-date).
We then worked to identify, implement and deploy a number of long-term solutions and caching improvements in response to the events that directly preceded this outage and the specific failures that they triggered, and we continued to monitor throughout the day. We are confident that the improvements we have made today will prevent this issue from happening again.
We sincerely apologise for any inconvenience this outage may have caused you.
Please don't hesitate to contact us at
support@openexchangerates.org with any questions, comments or concerns about this incident.
Kind regards,
– Open Exchange Rates Team
[1] Retrying failed requests is a common integration practice, although we strongly recommend waiting at least 1 second in between retries to maximise the chance of a successful response.