Sorry to borrow the title from the book series, but it seemed to fit.
We’ve been attempting to actually improve automatic monitoring and problem resolution and it has a negative impact… The other day we started swapping the Power Distribution Unit (PDU) out in Dallas to one that actively monitors and can restart servers if one fails to respond. The one server it was being tested on failed to respond and after attempting for 2 minutes the new PDU reset the server. On restart, the server decided the filesystem needed checking and mounted it in read-only mode, but the system came back up. The other monitoring systems check for web, smtp and pop email, as well as mysql connections being accepted. And within minutes of the reset, all of them were once again accepting connections. The problem was the file system was in read-only mode - so no data could be written. Because connections were being accepted, and web pages were being served (although many would have had errors if they required any writes or database access), the monitoring failed to accurately portray the true state of the server.
Connected events affected our email, voicemail (tied to email via a virtual pbx system) and the ticket/helpdesk system which integrates tightly with email.
Changes we will be making to address this:
- First, the lagniappeinternet.com website and support software will be moved to a virtual machine. The vm will be setup to be automatically replicated to the Atlanta D.C. on a periodic basis. With a DNS update this would bring the ticket/helpdesk back up very quickly.
- We’re investigating how to change monitoring so that it can detect a disk/filesystem fault such as read-only status. Some customers may already be aware we’ve been working on developing an ‘active’ or ‘smart’ monitoring system of our own… This experience will be taken into consideration on that product as it’s the scenario that other monitoring systems don’t catch.
- We’ve added third party routing for voicemails to support. This will let us get voicemail messages even if email is offline.
Comments Off