DNS outage on 54.174.40.213
Incident Report for DNSWatch
Postmortem

Summary

On July 29, 2020, the primary DNSWatch DNS nameserver in the Americas (USA / US-EAST) region experienced a disruption. During this time, US based partners and customers with DNSWatch enabled on their Fireboxes experienced intermittent issues connecting to the internet. We sincerely apologize for any inconvenience this caused and want to share more details about what happened, what we've already changed, and what we're working on to ensure DNSWatch provides the reliability you expect and deserve.

A full timeline of this incident, as well as other issues and incidents, is tracked on our public DNSWatch statuspage.

What happened?

Following a routine deployment, our primary DNS nameserver in the Americas region stopped responding to DNS requests. As a result, US based customers and partners with DNSWatch enabled on their Fireboxes experienced intermittent issues connecting to the internet from 4:52PM UTC to 8:10PM UTC.

All customers and partners with GO Clients and protected networks deployed were not impacted during this incident; only US based customers with DNSWatch enabled on their Firebox were impacted.

Why did it happen?

The incident resulted from our primary DNS nameserver running out of available disk space. Despite monitoring available disk space and having alarms that alert the engineering team, this lack of disk space went unaddressed. When deploying to these systems, processes were unable to be restarted, and the systems failed to recover.

Additionally, our team found that during primary DNS nameserver failures, users behind a Firebox will experience internet connection issues while the device shifts to using the secondary nameserver.

How did we mitigate it?

The engineering team allocated more disk space to the impacted systems and restarted services, restoring DNS services.

What are we doing about it?

We are taking three actions moving forward:

  1. Configure logging and other temporary uses of disk space to use a separate disk volume. This ensures that surges in log volume will not impact other services on the same system.
  2. Re-evaluate operational alerts related to disk space and other system metrics to ensure they are addressed properly.
  3. Evaluate how the Firebox uses primary and secondary nameservers when DNSWatch is enabled, with the goal of balancing traffic across both.

Moving forward

You can monitor the status of DNSWatch services and receive email notifications by subscribing to our DNSWatch statuspage.

We sincerely apologize for the impact to our affected customers and value the opportunity to meet your security needs.

Posted Aug 05, 2020 - 15:54 EDT

Resolved
This incident has been resolved.
Posted Jul 31, 2020 - 12:31 EDT
Update
All DNSWatch services are now fully operational and the issue has been resolved.
Posted Jul 31, 2020 - 12:31 EDT
Update
We had an additional outage between 3:55PM EDT - 4:10PM EDT in response to the fix we implemented for the previous outage. We do not anticipate any subsequent outages and the issue has been resolved.

We will continue to monitor this incident and keep it updated.
Posted Jul 29, 2020 - 16:31 EDT
Monitoring
During a deployment to our production systems, DNSWatch experienced a DNS outage on 54.174.40.213 (between the times 12:52PM EDT - 1:22PM EDT and again between 1:40PM EDT - 1:54PM EDT). While investigating this outage, we were able to discern the underlying issue and have resolved it.

All DNSWatch services are now fully operational.
Posted Jul 29, 2020 - 14:24 EDT
This incident affected: DNS (DNS (US)).