I love data – and the importance of monitoring

I find failure causes interesting.  I enjoy fixing things, but more so I like learning why something is broke.  A large part of fixing any problem is finding the root cause.  One of the things that can aid in finding a root cause is data.  The more data you have on something, the easier it is for you to spot an anomaly and track down the root cause.

Let’s take a real world example.  We implement a monitoring tool call Nagios XI where I work and it does all sort of nifty monitoring and alerting for us.  I recently got a notification that disk space was running low for a particular mounted disk that is used to store various files.  This alert was unusual as the files on this disk are only kept for a specific time period (45 days in this case) and then they are purged.  So seeing this alert raised an eye brow.  First thing I did?  Check our monitoring system.  Let’s see what recent disk usage activity looks like for the last few months:

Look at that data! Isn’t it wonderful?  No?  Sure it is!  Let me explain… you can see things humming along nicely.  Files are getting deleted regularly keeping space usage in check… until mid-February.  Well that’s weird… what changed?  Remember what I said about the 45 day retention period?  Think what happened about 45 days before mid-February… That’s right, the New Year!  Our year changed from 2016 to 2017.  At this point, things start clicking.  Hop in to the server, and my suspicions are confirmed.  The retention job points to a year specific main folder.  It was pointed to the 2016 folder.  Simply updating the job to point to the 2017 folder and re-running the job fixed the issue.  Problem solved! At least for a year 🙂

And thus the importance of monitoring your stuff.  Had we not been monitoring this server, we wouldn’t have known this cron job broke and that the disk was approaching full. What happens when a disk maxes out?  Well, new data can’t be written, and would have been lost. Proactive monitoring once again saves the day, allowing us to to fix a problem before it actually becomes a real problem!  Pretty neat, you know, if you’re in to that kind of stuff.