We’ve already explored eight items that should be monitored to help you find more flow, but there’s one more that we need to address: the underrated monitoring of your monitoring solution itself.
Unexpected Down Time Might Be Around the Corner
Imagine you’re working a steady, typical day managing the infrastructure of your organization. Suddenly, all systems are down, including your main monitoring system. Alerts and notifications can’t come through, so you and your team frantically search for the source of the problem. Trying to determine the scope of the issue while hoping down time won’t exceed the Service Level Agreement (SLA) and lead to failure can be difficult enough but, without alerts, your team is running in the dark.
Eventually, you realize a random and unusual network hiccup caused the commotion. Despite finding a solution, your team and the organization lost valuable time, and now you’re confronted with frustrated managers and post-incident reports.
Even with thorough testing and planning to deploy patches or updates, these instances can still happen. System shutdowns can also result from internet outages, server hardware failures, cloud outages, and at worst, a complete breakdown in power and electricity.
Regardless of cause, an outage is extremely stressful when system administrators and directors have limited time to fix problems before potentially costing the organization thousands of dollars.
Improving Your Single Source of Truth
You rely on the combination of security information systems and security event management to protect you from disaster. A secondary monitor outside the primary network to track overall system health is the extra support you may have never considered. When you have a back-up system – a monitor to monitor the monitor, so to speak – you can be proactive in finding a solution instead of reacting to the problem, which means less costly downtime.
If you’re installing your secondary monitor off premises, you’ll still have a functioning external monitor to alert you if the internal network is down. If you choose to install the secondary monitor on premises, or in the cloud, you’ll still be able to maintain the functionality of being off premises.
When you only focus on applications being used, and not the entire system or network, you’re missing an opportunity to safeguard valuable data and information. This is how you improve a single source of truth—by creating an additional second source to validate the initial response. For example, this two-step process enables you to determine if Amazon Web Services (AWS) has caused an outage or if this is a unique network error.
The beauty of monitoring your monitoring system is that the secondary monitor doesn’t need to keep track of everything your primary monitor is tracking. The goal is to get a picture from the outside looking in of the overall system health and observe primary needs like connectivity to know if the system is down.
The Difference Between Test and Disaster Recovery, and Monitoring the Monitor
Many people believe that using the test and disaster recovery method is enough of a failsafe for their monitoring solution. Unfortunately, while those are key ingredients to a healthy network, it still leaves a gap. There are two key differences between monitoring your monitor and using the test and disaster recovery approach. First, look at the three steps of test and disaster recovery:
- Production: Regularly occurring checks that all systems are operating correctly.
- Test: Running programs multiple times to investigate pending positive alerts, updates, and patches before sending notifications to your team.
- Disaster Recovery: Open communication between production and test systems that allows course-correcting issues automatically or sending notifications to your team to fix problems before they escalate.
While this solution is thorough, it is a standby approach to monitoring. Once an error occurs you launch into these steps to determine the breadth of the issue. The test and disaster recovery approach may be effective for an optimized workflow post-incident but a solution that has fewer steps and is proactive to problem solving will help you learn when real time monitoring is best, and avoid notification fatigue.
If you had monitored your monitor, you would have already known there was trouble before it escalated to impact additional applications. The secondary monitor is also an active approach and birds-eye view of your systems. Rather than being reactive, the system is automatically pulling, analyzing, and logging data constantly. You’ll still receive alerts or notifications immediately if the system goes down rather than waiting for the hammer to fall.
Protect Your Future with Increased Findings of System Disruptions
Once you have resolved any issues, you’ll want the ability to gather data and determine next steps. The findings from the logged information can provide a full story to determine the true source of issues.
Comparing sites like Google or Facebook, as seen above, helps determine that outside sources were also impaired by an issue and provides greater clarity to the scale of the instance. You can see that Google.com was not impacted, further clarifying that this was a local network issue.
Built-in reporting gives insight to data after a disaster like this so that you can fully understand the impact and the best next steps.
Additionally, an SLA report to learn how or if the outage impacted any promise, commitment, or contract you have with another company offers a more meaningful understanding of what happened.
More insightful information equals more meaningful connections to your clients as you reassure their concerns that your service or product is consistently reliable and stable.
These types of reports create a full story about exactly what went wrong, the cause, and the result. You feel empowered to purchase new batteries, additional cloud space, or additional monitors and servers to minimize the risks of another problem arising. This transparency then increases your company’s bottom line and reliability for your clients.
Implementing a Solution to Monitor Your Monitor with Nagios XI
Though this can be accomplished through any monitoring service, Nagios XI offers 7-nodes for free with 100 total service checks that can be installed nearly anywhere.
You can start small with two active monitoring instances through something like a virtual Linux machine installation for a couple of dollars or use Linode to spin up Nagios XI. Hardware and cloud computing costs are negligible to the time and money saved through decreasing down time and increasing client confidence.
The added assurance of Nagios XI visualizations and data keeps you working effectively, remotely or on site. You won’t have all the same reporting options as the more extensive Nagios XI Enterprise version, but this solution meets the needs of your core concern: Is my system working? A secondary monitoring system may be the most underrated tool that offers some of the largest payoffs for your business. With Nagios XI you can avoid costly and time-consuming catastrophes and bring in new success.