I find failure causes interesting. I enjoy fixing things, but more so I like learning why something is broke. A large part of fixing any problem is finding the root cause. One of the things that can aid in finding a root cause is data. The more data you have on something, the easier it is for you to spot an anomaly and track down the root cause.
Let’s take a real world example. We implement a monitoring tool call Nagios XI where I work and it does all sort of nifty monitoring and alerting for us. I recently got a notification that disk space was running low for a particular mounted disk that is used to store various files. This alert was unusual as the files on this disk are only kept for a specific time period (45 days in this case) and then they are purged. So seeing this alert raised an eye brow. First thing I did? Check our monitoring system. Let’s see what recent disk usage activity looks like for the last few months:
Look at that data! Isn’t it wonderful? No? Sure it is! Let me explain… you can see things humming along nicely. Files are getting deleted regularly keeping space usage in check… until mid-February. Well that’s weird… what changed? Remember what I said about the 45 day retention period? Think what happened about 45 days before mid-February… That’s right, the New Year! Our year changed from 2016 to 2017. At this point, things start clicking. Hop in to the server, and my suspicions are confirmed. The retention job points to a year specific main folder. It was pointed to the 2016 folder. Simply updating the job to point to the 2017 folder and re-running the job fixed the issue. Problem solved! At least for a year 🙂
And thus the importance of monitoring your stuff. Had we not been monitoring this server, we wouldn’t have known this cron job broke and that the disk was approaching full. What happens when a disk maxes out? Well, new data can’t be written, and would have been lost. Proactive monitoring once again saves the day, allowing us to to fix a problem before it actually becomes a real problem! Pretty neat, you know, if you’re in to that kind of stuff.September 7th, 2016
Taking on a new client is a fairly normal occurrence most of the time. It usually goes decently smooth, getting domain and hardware passwords transferred over, sharing knowledge collected over time, making notes of any gotchya’s or unique issues with a client. Every once in a while though taking over a client leads to a complete horror of horrors in discovering how many things were done wrong and what a dangerous position the previous company had left their now former client in.
I’ve been doing this for a decade now and I thought I’d seen it all, but a recent case has proved to me to never underestimate the ability of someone to royally hose things up.
The original reason we were called in was because they had complained of their server freezing up. They had called their IT people 2 weeks ago and they kept getting put off. They were tired of their server freezing so they called us in. What did we find on arrival? A failing hard drive. Something that could have taken down their entire business, and the former IT company put it off for who knows whatever reason?! The good news was the disk was in a raid array, so they had some redundancy, but the failing disk was still causing the server to hang quite frequently. So, we replaced that and rebuilt the array.
The next issue we discovered during array maintenance, and that was a completely dead battery on the controller. So, we replaced the battery.
Next up, the server wasn’t even on a UPS. It was plugged in to the “surge” side (not the battery side) of a UPS, and the UPS wasn’t even big enough to handle the server anyway. So, we got them an appropriately sized UPS.
So, what if the array had died? What if they had lost power and ended up with corruption from a dead array battery and absent UPS? Well, they could have restored from backups, right? HAHAHA! No, no they couldn’t have. The “cloud” backup they were being charged from their previous company wasn’t even backing up any shared files. All of the business’s proprietary data would’ve been GONE. Their cloud backup was only configured to back up the “Program Files” directory, which would’ve been god damn useless in a disaster recovery situation.
While we’re on the subject of billing for services not being provided, we also found that they were getting charged for website hosting. The problem? Their IT company wasn’t hosting their website. They were hosted at another provider in town. The ONLY service their IT company was hosting was public DNS for their site, yet they were billing them at full website price. Nice little scam they had going there, don’t you think?
I wish I could tell you the horrors stopped here, but they don’t.
Well, it is 2016 and my Linksys E2000 router that I’ve been using since 2010 and running DD-WRT on was still in use. It was still a fine router for what I was using it for, but it is starting to show its age. For one, it didn’t support dual band Wifi (2.4ghz and 5ghz). A month or so ago I decided to replace the Wifi functionality part of it with a Ubiquiti Unifi AC-AP-Pro to get better coverage and better wireless speeds, both of which were accomplished. I’ve been pretty impressed with it and Ubiquiti’s controller software that I was becoming interested in some of their other products. As a coincidence, my local Fiber-to-the-Home ISP announced that they will be rolling out Gigabit fiber access. Previously, you could only get up to 250mbit. I was on the 50mbit package, but for the Gig rollout, they’re running a promotion where you can lock in to the Gigabit speeds for as long as you have service with them for only $10/more a month than I was paying for the 50mbit fiber. So, 20x the bandwidth for $10 more a month is a no-brainer for me. This meant I had to upgrade my router though. My Linksys E2000 running DD-WRT was only capable of about 60mbit throughput on the WAN interface due to its aging CPU. I was already pushing it close with my 50mbit net, but Gigabit would be far too much for it to handle. So I did some research and ended up selecting the Ubiquiti EdgeRouter Lite. These are powerful little machines that run Ubiquiti’s EdgeMax OS which is a customized fork of the linux Vyatta routing software suite. It seemed to have the best bang for the buck features, it was from Ubiqiuiti which I was already interested in and have one of their products, and most importantly it can push FULL gigabit line speed through the WAN!May 4th, 2016
Thanks to the Let’s Encrypt project you can now browse my website in glorious https using the free Let’s Encrypt certificate. It was fairly simply to get the certificate issued and set up the automatic renewal job on the web server. Let’s Encrypt is providing an interesting service to the masses by making basic https (TLS) encryption free and easily accessible. In years past SSL certificates would cost hundreds of dollars. That has changed in recent years with basic certificates coming down to just $5. But, for sites like this one that I run for fun and don’t make money off of, even $5 seemed like an unnecessary cost and hassle. Well, Let’s Encrypt virtually eliminates both of those two final barriers. By providing free certificates issued through their script, there is practically no reason to NOT be running https on your site now. Thanks Let’s Encrypt 🙂October 18th, 2015
Seriously, it is 2015 now. Every big service provider should be supporting some form of 2-factor authentication. Google is a prime example of the right way to implement this, and everyone should be following their lead. This weekend I had an email account I hadn’t used in over a year get its password cracked. The bot then pulled my extremely outdated online address book and sent spam links out to them all. Fantastic! So, I changed the password and deleted all of the contacts out of the address book. Had this provider (cough… AOL …cough) had a 2FA implementation this would have NEVER been able to happen. Their service wouldn’t have been used to send out spam, and I wouldn’t look like a doofus with an apparently weak password on that old account.
I’ll also add, if you have a service like Google and you’re NOT using 2FA, you need to go set that shit up right now. It makes your account nearly IMPOSSIBLE to get in to unless the hacker also has your physical device (usually your phone with an app, I recommend Authenticator Plus) to access your account. Knowing your login name and password alone would never get them in.
Wondering if a service you use supports 2FA or now? Well, check out this nifty website: https://twofactorauth.org/October 17th, 2015
Google has FINALLY fixed all of the issues with Lollipop. Too bad it did take a whole system update to a new version of Android to get there, but alas, WE ARE THERE! The awful memory management issues I originally complained about are completely resolved at last. In addition, the new Doze feature which puts the phone in to a super deep sleep when idle for a while has made a tremendous impact on standby battery life. I’ve picked up my phone 6 hours after setting it down and seen maybe 1-2% battery loss. This is a stark improvement as any previous version of android would’ve easily given a 5-10% loss in the same time period. The added little tweaks like FINALLY being able to show battery percentage in the pull down shade is nice, and the new permissions system will eventually be nice too. I say eventually because no app can really use it until the devs release an update for their app targeting the new Marshmallow API level. But it’s finally here, and as more and more apps incorporate it, things will just be better and better. Now if Google could just pull its head out of its ass and put Qi charging back in the 2016 Nexus phone(s)…January 26th, 2015
Android 5.0 Lollipop came out back in mid November 2014. So, it has been out a couple months as of the writing of this. There are a lot of really annoying bugs in Lollipop like caller ID pictures just refusing to show up, silent mode completely broken, horrible navigation icons, and lock screen no longer locking the phone.
But the worst thing about Lollipop is the CONSTANT memory leakage. I have a Nexus 5 and this shit is out of control and Google hasn’t done shit all to fix it yet. Pre-Lolliflop, my Android devices could quite easily obtain MONTHS of uptime, and reboots were usually just to due something like updating the recovery image or something. In Lolliflop, something is leaking memory at the system level so severely that the phone can’t even make it 2 weeks with out getting so low on useable ram that even the damn keyboard will open, so you have to restart your phone if you even want to just be able to text people again.
Check this out:July 27th, 2014
Getting the “Pushbullet Notification Failed” error from your SickBeard notifications? Well, the fix is simple, and here is what you have to do:
1) Find pushbullet.py in your SickBeard install directory. For me, this file was located in:
2) Open that file in your editor of choice, for me that is vi, and make the following change:
if method == ‘POST’:
uri = ‘/api/pushes’
uri = ‘/api/devices’
And change to:
if method == ‘POST’:
uri = ‘/v2/pushes’
uri = ‘/api/devices’
Note that the devices uri doesn’t change, you just need to update the pushes uri. I could dig in to the in’s and out’s of the Pushbullet API if I really felt like figuring out why this changed, but I don’t really care that much. I just wanted to get my notifications working again.
3) Write and Quit the changes and then restart SickBeard. This will recompile your changes in the python code and then hop in to your notifications and send a test message to make sure that worked.
That fixed it for me! Happy Pushbulleting 🙂
UPDATE 9/14/2014: Well, Something broke again. I couldn’t figure out what (I think it has to do with the JSON body but I didn’t feel like rewriting something that’s been fixed elsewhere), so I snagged the latest pushbullet.py from the troubled SickRage project and used that instead. Note, I had to switch the devices uri back to the /api/devices value because the /v2/devices value doesn’t work for me for some reason. Hopefully this doesn’t break again :-\