Caching issue caused API timeouts and 500 errors for approximately 4 hours

Incident Report for Open Exchange Rates

Resolved

This issue began at around midnight UTC on the 17th March, and was resolved at approximately 4am UTC, with full service restored by 4.45am.

This was an uncharacteristic outage, triggered by a spike in requests occurring shortly after midnight UTC, combined with an unforeseen issue with AWS ElastiCache, which led to critically high database connections and CPU utilisation. At this point, clients began receiving timeouts, and began retrying their requests at a high frequency, which further multiplied the issue [1].

By this stage, our auto-scaling infrastructure was not able to mitigate the issue and manual intervention was required. After investigating to ascertain the root causes, we rolled out a temporary solution which resolved the immediate bottlenecks. This included temporarily switching off API usage logging, and doubling our overall database capacity (logging has since been restored and clients' usage statistics are fully up-to-date).

We then worked to identify, implement and deploy a number of long-term solutions and caching improvements in response to the events that directly preceded this outage and the specific failures that they triggered, and we continued to monitor throughout the day. We are confident that the improvements we have made today will prevent this issue from happening again.

We sincerely apologise for any inconvenience this outage may have caused you.

Please don't hesitate to contact us at support@openexchangerates.org with any questions, comments or concerns about this incident.

Kind regards,

– Open Exchange Rates Team

[1] Retrying failed requests is a common integration practice, although we strongly recommend waiting at least 1 second in between retries to maximise the chance of a successful response.

Posted Mar 17, 2017 - 19:15 UTC

Monitoring

We have resolved the immediate issue and API service has returned to normal. We are continuing to monitor and apologise for any inconvenience caused.

Posted Mar 17, 2017 - 04:38 UTC

Update

We have identified an issue in our AWS caching infrastructure which began at approximately 00:00 UTC on March 17th and caused a large volume of 500 errors and timeouts on API requests. We have performed a number of maintenance procedures to stabilise the issue and are now working to implement a solution.

Posted Mar 17, 2017 - 04:21 UTC

Identified

We are looking into an API platform issue, which is causing intermittent timeouts and 500 errors for API clients. We will update as soon as we have any further information.

Posted Mar 17, 2017 - 02:33 UTC