We experienced an API outage this morning, caused by upstream issues at our infrastructure and cloud services provider, Amazon Web Services (AWS), in their US East (N. Virginia) region.
This incident manifested as a high number of 5XX errors and timeouts for our API clients, lasting for a period of about 2 hours between approximately 06:05 and 08:07 UTC.
Our platform is built on AWS for high availability, with multiple layers of automatic failover and redundancy. As soon as we realised failover was not happening as expected, we worked with AWS support to identify the issue and were told "There was a internal hardware failure, that is supposed to be automatically recovered, but the recovery process failed."
They were able to resolve the issue and service returned to normal shortly after 8am. We are continuing to monitor throughout the day and waiting on further post-mortem from AWS. Going forward we are also investigating further safeguards or precautions we can take against such issues which arise outside of our control.
While our API did not suffer a total outage, during the failure period clients received a high proportion of failures and extremely high latency for any successful responses. We sincerely apologise for any issues caused by this outage.
If you have any questions or concerns about this incident, please don't hesitate to contact us at firstname.lastname@example.org