We have all been there. That moment when your heart sinks, human error meets technology, and your worst nightmare comes true. Some Nagios employees share past IT stories that still haunt them to this day.
One Saturday night while on-call I was 30 minutes away from work when the commercial power failed. Luckily, the diesel generator kicked in, but the automated transfer switch failed!
With only 19 minutes of UPS power, the entire data center had an ungraceful hard power loss before I could get there. This resulted in hundreds of downed servers, dozens of ESXi servers, and custom databases with startup scripts that required an intimate knowledge of how to start them. Needless to say, I was the only tech who responded and it was a very long 30+ hour job to bringing up the entire HQ infrastructure. My eyes bled!
We only lost one ESXi server which luckily turned out to be a websphere test environment. It was at this point that we realized that some of the backup files were corrupted so we had to rebuild them from scratch, yay, more fun! I was very very thankful that it was only a test machine.
This happened twice! The generator didn’t kick on the next time… at least the transfer switch worked and another tech responded with me. That time it was only about 15 hours of overtime.
-Nagios Support Analyst
Eerie Event Processing
I was a new development team member, eager to prove myself. I had already been working in the codebase for a few years, so I was mostly comfortable with jumping in and making changes.
Always on the lookout for performance enhancements, I came up with a way to decrease the memory footprint for the software to handle specific events. This was a massive improvement for large systems!
The old way would process each event individually as they occurred. The new and improved way would queue the events up until a certain number of events were saved or a certain time period had gone by before it started processing them.
I tested, then development team tested, and then the QA department tested.
However, when we released the software, we were getting an unusually large amount of bug reports – all for the same thing. Events weren’t being processed!
As it turns out, there were some international characters that were causing an issue during the event queueing process – which explains why we didn’t catch it during the QA process.
Thankfully, we identified the issue almost instantly and were able to issue a patch that same day.
-Nagios Product Development Manager
Back in the early 90’s, my first job was as a systems administrator. My first day on the job, a forklift dropped off a 3’x3’x3′ crate in front of my desk. My boss stopped by and said “I need you to have this setup for production by Friday”. I replied, “What is it?”
He went on to tell me it was an Ascend Max TNT (http://www.mtmnet.com/PDF_FILES/Lucent_Max_TNT_DataSheet.pdf) remote access concentrator, which contained 960 modems and networking which contained some proprietary operating system, but not to worry because inside the crate was the manual which should contain all the information I need on how to configure it. (This was before you could Google the answer to almost anything).
He was right about the crate containing the manual, it contained not one, but four 300+ page manuals with detailed instructions on every use case other than what we wanted to use it for.
One similar manual can still be found today
Luckily, after many long hours, I was able to figure out how to configure it.
-Nagios DevOps Engineer
Dreadful Data Center
In 2011 I inherited a datacenter and had a half day to learn every physical/logical quirk about the multi-site infrastructure as possible during the IT Director’s last day of work. It seemed fairly straight forward at face value. 33 million users/mo, 35k mailboxes, a few petabytes of used storage, etc. Most of the gear was up to the task, but it was all fairly dated and needed to be overhauled.
Back then, if you didn’t properly configure storage, Windows could only see up to 2TB of space. We apparently had our entire flat file mail store located on one of these drives with no clear path to migrate it off any time soon. The drive was 90% full and climbing. Having to put out many fires per hour, I had not yet configured Nagios to monitor this, and no other perf monitoring/alerting was in place.
One night that remaining 10% suddenly closed the gap and the call centers lit up. The mail server program at the time had no ability to see multiple drives and migrating the data off the maxed drive was projected to take months based on the physical performance limitations. If we tried to move the data any faster than 1 mailbox every 10 minutes, the system would become unstable and take down other items like web services, SQL servers, etc. Basically a domino effect. So that’s what we ended up having to do… One mailbox at a time…
Part of me wonders if they are still migrating that data…
-Nagios Operations Specialist
Nagios is there to alleviate all your IT Horrors! Download your free, fully-loaded 60-day trial of all the Nagios products here.
We want to hear from you! Share your IT Horror Story below.